ponweist / Wannier90-PRACE

Optimizations for Wannier90 (fork repository - see http://wannier.org for the official version).
GNU General Public License v2.0
1 stars 0 forks source link

Improve get_morb_R #3

Closed ponweist closed 10 years ago

ponweist commented 10 years ago

In get_CC_R (_getoper.F90, lines 781ff.), the following matrix product is done inefficiently:

                ! Transform to projected subspace, Wannier gauge
                !
                H_qb1_q_qb2(:,:)=cmplx_0
                do m=1,num_wann
                   do n=1,num_wann
                      do i=1,num_states(qb1)
                         ii=winmin_qb1+i-1
                         do j=1,num_states(qb2)
                            jj=winmin_qb2+j-1
                            H_qb1_q_qb2(n,m)=H_qb1_q_qb2(n,m)&
                                 +conjg(v_matrix(i,n,qb1))&
                                 *Ho_qb1_q_qb2(ii,jj)&
                                 *v_matrix(j,m,qb2)
                         enddo
                      enddo
                   enddo
                enddo

A similar improvement as for get_AA_R (see #2) needs to be done.

ponweist commented 10 years ago

Note that the critical code section has been duplicated to get_morb_R (_getoper.F90, lines 1006ff.)

This is the current trace (16sm case, 32 processes, all berry tasks enabled, kpath and kslice disabled): trace-iss3

ponweist commented 10 years ago

Performance analysis for 16sm case running on 32 processes with the following parameters:

kpath = F
kslice = F

berry = T
berry_task = ahc,morb,kubo
berry_kmesh = 32 32 32

New trace: trace-iss3-fix

Performance (in CPU cycles) improvement relative to previous code version:

Routine Previous Current Speedup factor
berry_main 1.1e13 7.8e12 ~ 1.4
get_morb_R 3.8e12 5.6e11 ~ 6.8
ponweist commented 10 years ago

The next bottleneck in get_morb_R appeared in lines 854ff:

          ! Wannier-gauge overlap matrix S in the projected subspace
          !
          call get_win_min(ik,winmin_q)
          call get_win_min(nnlist(ik,nn),winmin_qb)
          S=cmplx_0
          H_q_qb(:,:)=cmplx_0
          do m=1,num_wann
             do n=1,num_wann
                do i=1,num_states(ik)
                   ii=winmin_q+i-1
                   do j=1,num_states(nnlist(ik,nn))
                      jj=winmin_qb+j-1
                      x = conjg(v_matrix(i,n,ik))*S_o(ii,jj)&
                           *v_matrix(j,m,nnlist(ik,nn))
                      S(n,m)=S(n,m) + x
                      H_q_qb(n,m)=H_q_qb(n,m) + x*eigval(ii,ik)
                   enddo
                enddo
             enddo
          enddo

Check if an extended version of get_gauge_overlap_matrix with an optional output parameter for H_q_qb can be used here.

ponweist commented 10 years ago

Now using extended routine get_gauge_overlap_matrix with optional output parameter for H_q_qb.

New trace: trace-iss3-fix2

New performance analysis:

Routine Previous Current Speedup factor
berry_main 1.1e13 7.4e12 ~ 1.5
get_morb_R 3.8e12 1.8e11 ~ 21

Time for initialization is down from ~53s to ~3s(!).