Improve get_AA_R - Githubissues

ponweist / Wannier90-PRACE

Optimizations for Wannier90 (fork repository - see http://wannier.org for the official version).

GNU General Public License v2.0

1 stars 0 forks source link

Improve get_AA_R #2

Closed ponweist closed 10 years ago

ponweist commented 10 years ago

In the initialization phase, the master rank spends terribly much time in get_aa_r, while other ranks wait in comms_bcast_cmplx - see trace: trace-iss2

The runtime-critical loop is in _getoper.F90, lines 426ff.:

          ! Wannier-gauge overlap matrix S in the projected subspace
          !
          call get_win_min(ik,winmin_q)
          call get_win_min(nnlist(ik,nn),winmin_qb)
          S=cmplx_0
          do m=1,num_wann
             do n=1,num_wann
                do i=1,num_states(ik)
                   ii=winmin_q+i-1
                   do j=1,num_states(nnlist(ik,nn))
                      jj=winmin_qb+j-1
                      S(n,m)=S(n,m)&
                           +conjg(v_matrix(i,n,ik))*S_o(ii,jj)&
                           *v_matrix(j,m,nnlist(ik,nn))
                   enddo
                enddo
             enddo
          enddo

This seems to calculate the product of three matrices. By exploiting associativity, i.e. A.B.C=(A.B).C, one loop nesting level could be saved. Moreover, BLAS should be used here.

ponweist commented 10 years ago

With the variable substitutions

ik_a=ik, ik_b=nnlist(ik,nn)
ns_a=num_states(ik), ns_b=num_states(nnlist(ik,nn))
wm_a=winmin_q, wm_b=winmin_qb,

the Wannier-gauge overlap matrix is now calculated as:

    allocate(tmp(ns_b,num_wann))

    call gemm(S_o(wm_a:wm_a+ns_a-1, wm_b:wm_b+ns_b-1), &
              v_matrix(1:ns_a, 1:num_wann, ik_a), &
              tmp, 'C', 'N')
    call gemm(tmp, &
              v_matrix(1:ns_b,1:num_wann,ik_b), &
              S, 'C', 'N')

For the 32sm case, the calculation could be speeded up from 3.05e11 to 5.9e8 cpu cycles (~ factor 610). The trace now looks like: trace-iss2-fix

ponweist commented 10 years ago

New trace after parallelizing kpath #5 and some cleanup of _getoper.F90 #7: trace-iss7-fix Total execution time improved from originally 6 minutes to less than 2 minutes.