Closed ponweist closed 10 years ago
With the variable substitutions
ik_a=ik
, ik_b=nnlist(ik,nn)
ns_a=num_states(ik)
, ns_b=num_states(nnlist(ik,nn))
wm_a=winmin_q
, wm_b=winmin_qb
, the Wannier-gauge overlap matrix is now calculated as:
allocate(tmp(ns_b,num_wann))
call gemm(S_o(wm_a:wm_a+ns_a-1, wm_b:wm_b+ns_b-1), &
v_matrix(1:ns_a, 1:num_wann, ik_a), &
tmp, 'C', 'N')
call gemm(tmp, &
v_matrix(1:ns_b,1:num_wann,ik_b), &
S, 'C', 'N')
For the 32sm case, the calculation could be speeded up from 3.05e11 to 5.9e8 cpu cycles (~ factor 610). The trace now looks like:
New trace after parallelizing kpath #5 and some cleanup of _getoper.F90 #7: Total execution time improved from originally 6 minutes to less than 2 minutes.
In the initialization phase, the master rank spends terribly much time in
get_aa_r
, while other ranks wait incomms_bcast_cmplx
- see trace:The runtime-critical loop is in _getoper.F90, lines 426ff.:
This seems to calculate the product of three matrices. By exploiting associativity, i.e. A.B.C=(A.B).C, one loop nesting level could be saved. Moreover, BLAS should be used here.