ponweist / Wannier90-PRACE

Optimizations for Wannier90 (fork repository - see http://wannier.org for the official version).
GNU General Public License v2.0
1 stars 0 forks source link

More efficient evaluation for trace formula #1

Closed ponweist closed 10 years ago

ponweist commented 10 years ago

In berry.F90, lines 811ff., the "trace formula" is evaluated inefficiently:

    ! Trace formula, Eq.(51) LVTS12
    !
    do if=1,nfermi
       do i=1,3
          !
          ! J0 term (Omega_bar term of WYSV06)
          mdum=matmul(f_list(:,:,if),OOmega(:,:,i))
          imf_k_list(1,i,if)=utility_re_tr(mdum)
          !
          ! J1 term (DA term of WYSV06)
          mdum =matmul(AA(:,:,alpha_A(i)),JJp_list(:,:,if,beta_A(i)))&
               +matmul(JJm_list(:,:,if,alpha_A(i)),AA(:,:,beta_A(i)))
          imf_k_list(2,i,if)=-2.0_dp*utility_im_tr(mdum)
          !
          ! J2 term (DD of WYSV06)
          mdum=matmul(JJm_list(:,:,if,alpha_A(i)),JJp_list(:,:,if,beta_A(i)))
          imf_k_list(3,i,if)=-2.0_dp*utility_im_tr(mdum)
          !
       end do
    enddo

For the 32sm case (berry_kmesh = 48 48 48), currently 30% of the runtime are spent in this calculation.

Note that complete matrix products are calculated although only the diagonal elements of these products are needed.

ponweist commented 10 years ago

Added methods utility_re_tr_prod and utility_im_tr_prod.

  function utility_im_tr_prod(a,b)
     !====================================================!
     !                                                    !
     ! Return Im(tr(a.b)), i.e. the imaginary part of the !
     ! trace of the matrix product of a and b.            !
     !                                                    !
     !====================================================!
    use w90_constants, only  : dp,cmplx_0,cmplx_i

    complex(kind=dp), dimension(:,:), intent(in) :: a, b

    real(kind=dp) :: utility_im_tr_prod
    real(kind=dp) :: s
    integer       :: i, j, n, m

    n = min(size(a,1),size(b,2))
    m = min(size(a,2),size(b,1))

    s = 0
    do i=1,n
      do j=1,m
        s = s + aimag(a(i,j) * b(j,i))
      end do
    end do
    utility_im_tr_prod = s
  end function
ponweist commented 10 years ago

The above "trace formula" is evaluated three more times in get_imfgh_k_list (berry.F90, lines 838ff.)

Similar improvement and code cleanup needed. Remove routine get_imf_k_list (duplicated code), make output parameters img_k_list and imh_k_list optional and use solely get_imfgh_k_list.

ponweist commented 10 years ago

Related commits:

Performance analysis for 32sm case running on 64 processes with the following parameters:

kpath = T
kpath_task = curv+morb+bands
kpath_num_points = 500
kpath_bands_colour = spin

begin kpoint_path
G  0.0   0.0   0.0  H  0.5  0.5  -0.5
H  0.5   0.5  -0.5  N  0.5  0.0  0.0
N  0.5   0.0   0.0  G  0.0  0.0  0.0
G  0.0   0.0   0.0  P  0.25  0.25  0.25
P  0.25   0.25   0.25  H  0.5  0.5  -0.5
N  0.5   0.0   0.0  P  0.25  0.25  0.25
end kpoint_path

kslice = F

berry = T
berry_task = ahc,morb,kubo
berry_kmesh = 48 48 48

Performance (in CPU cycles) improvement relative to original code version:

Routine Original Current Speedup factor
(overall) 7.9e14 4.4e14 ~ 1.8
berry_main 5.2e14 1.8e14 ~ 3
get_imf[gh]_k_list 1.3e14 + 5.9e13 4.1e13 ~4.6
ponweist commented 10 years ago

Comparison of the number of matrix products in the innermost loop of get_imfgh_k_list:

Original code Optimized product traces Smarter products
-2Im[f] 4 0 0
-2Im[g] 9 6 2
-2Im[h] 12 9 3
Overall 25 15 5 (+2)*

*) Two matrix products are independent on the innermost loop variable.

ponweist commented 10 years ago

Update for the above performance analysis (32sm case running on 64 processes):

Routine Original Current Speedup factor
berry_main (excluding init) 2.6e14 1.0e14 ~ 2.5
get_imf[gh]_k_list 1.3e14 + 5.9e13 3.6e13 ~ 5.3
fermi energy loop 8.3e13 + 1.5e13 7.5e11 + 7.8e12 ~11.5