Closed ponweist closed 10 years ago
Added methods utility_re_tr_prod
and utility_im_tr_prod
.
num_wann
^3) to O(num_wann
^2). function utility_im_tr_prod(a,b)
!====================================================!
! !
! Return Im(tr(a.b)), i.e. the imaginary part of the !
! trace of the matrix product of a and b. !
! !
!====================================================!
use w90_constants, only : dp,cmplx_0,cmplx_i
complex(kind=dp), dimension(:,:), intent(in) :: a, b
real(kind=dp) :: utility_im_tr_prod
real(kind=dp) :: s
integer :: i, j, n, m
n = min(size(a,1),size(b,2))
m = min(size(a,2),size(b,1))
s = 0
do i=1,n
do j=1,m
s = s + aimag(a(i,j) * b(j,i))
end do
end do
utility_im_tr_prod = s
end function
The above "trace formula" is evaluated three more times in get_imfgh_k_list
(berry.F90, lines 838ff.)
Similar improvement and code cleanup needed. Remove routine get_imf_k_list
(duplicated code), make output parameters img_k_list
and imh_k_list
optional and use solely get_imfgh_k_list
.
Related commits:
Performance analysis for 32sm case running on 64 processes with the following parameters:
kpath = T
kpath_task = curv+morb+bands
kpath_num_points = 500
kpath_bands_colour = spin
begin kpoint_path
G 0.0 0.0 0.0 H 0.5 0.5 -0.5
H 0.5 0.5 -0.5 N 0.5 0.0 0.0
N 0.5 0.0 0.0 G 0.0 0.0 0.0
G 0.0 0.0 0.0 P 0.25 0.25 0.25
P 0.25 0.25 0.25 H 0.5 0.5 -0.5
N 0.5 0.0 0.0 P 0.25 0.25 0.25
end kpoint_path
kslice = F
berry = T
berry_task = ahc,morb,kubo
berry_kmesh = 48 48 48
Performance (in CPU cycles) improvement relative to original code version:
Routine | Original | Current | Speedup factor |
---|---|---|---|
(overall) | 7.9e14 | 4.4e14 | ~ 1.8 |
berry_main |
5.2e14 | 1.8e14 | ~ 3 |
get_imf[gh]_k_list |
1.3e14 + 5.9e13 | 4.1e13 | ~4.6 |
Comparison of the number of matrix products in the innermost loop of get_imfgh_k_list
:
Original code | Optimized product traces | Smarter products | |
---|---|---|---|
-2Im[f] | 4 | 0 | 0 |
-2Im[g] | 9 | 6 | 2 |
-2Im[h] | 12 | 9 | 3 |
Overall | 25 | 15 | 5 (+2)* |
*) Two matrix products are independent on the innermost loop variable.
Update for the above performance analysis (32sm case running on 64 processes):
Routine | Original | Current | Speedup factor |
---|---|---|---|
berry_main (excluding init) |
2.6e14 | 1.0e14 | ~ 2.5 |
get_imf[gh]_k_list |
1.3e14 + 5.9e13 | 3.6e13 | ~ 5.3 |
fermi energy loop | 8.3e13 + 1.5e13 | 7.5e11 + 7.8e12 | ~11.5 |
In
berry.F90
, lines 811ff., the "trace formula" is evaluated inefficiently:For the 32sm case (
berry_kmesh = 48 48 48
), currently 30% of the runtime are spent in this calculation.Note that complete matrix products are calculated although only the diagonal elements of these products are needed.