[Feature suggestion] OpenMP Parallelization

sthartman commented 2 years ago

Hello, I've been working on parallelizing the KPROJ code using OpenMP, I have a minimal working version by now. I've gotten around 10x speedup by running on all 36 cores of a compute node - would you be interested in merging that once I get all the details sorted out and pushed to my fork?

jinxin68 commented 2 years ago

Hi, @sthartman I'm very interested in your version ! When will you update your version?

sthartman commented 2 years ago

I've just updated. I've tested the bands_wt_layer to check that it reproduces the same results as the serial version within rounding error. I don't remember if I ever got the bands_wt_all subroutine working correctly in parallel, since I didn't need it for my particular case. It looks like I tried to parallelize it, it shouldn't be that different than the other one. Unfortunately I don't think I have my input files any more or I'd test its accuracy myself.

mxchen-2020 commented 2 years ago

@sthartman Hi, Steven, I am sorry for the delayed reply. I almost forgot this depository. Many thanks for your work on the program. Have you already updated the subroutine of the program? In fact, my student recently worked out interfaces to the LCAO code ABCUS and Phonopy. In addition, he replaced the FFT part with that from the intel lib, which can also speed up calculations significantly. We will update the program soon.

sthartman commented 2 years ago

the bands_wt_layer in mod_kproj.f of my fork definitely works, i think bands_wt_all also works but it's been 4 months since I did it so I don't remember the details.

jinxin68 commented 2 years ago

Hi, @sthartman . Many thanks for your work on the program. It reproduces the same bandunfolding results as before version. 2×2 graphene was used. Plane-wave energy cutoffs of 400eV were used for the electronic structure calculations. But it's seems to be something wrong with your code when use 6×6 graphene and plane-wave energy cutoffs was 400eV . I try to fix it but failed .Prompt information : Segmentation fault (core dumped).

jinxin68 commented 2 years ago

Hi, @sthartman ! I am very glad to tell you that I have fix this problem . You should use "!$omp parallel" before "!$omp parallel do"（359 row in mod_kproj.f）. But the program was slower than serials version . I think it's because the "!$omp parallel" will create NKPTS*NSPIN times .

sthartman commented 2 years ago

That's interesting. It's possible that the segfault could be a memory problem, the big array is shared between threads but some of the other variables are duplicated. What value of OMP_NUM_THREADS are you using?

jinxin68 commented 2 years ago

@sthartman I didn't set OMP_NUM_THREADS ，so it will use all threads .For my computer , it's value is equal to 16 . Can you test it as I said before？

sthartman commented 2 years ago

The test you're asking about, do you mean putting the $omp parallel in? I can try to do that later today, if I can find any of my input files.

jinxin68 commented 2 years ago

Yes,it is. In addition, I don't think large arrays are shared between threads，because the largest array is “q_pw” . But it's private . Thank you very much for your help！

sthartman commented 2 years ago

Going back and doing some tests, it seems that I have to put export OMP_STACKSIZE=1024m in my batch script, to set the environment variable, or else it segfaults. The default OMP_STACKSIZE is very small, only a few MB, is there a chance that it's overflowing the default stack when you run the larger 6x6 graphene example? The system I'm testing is a big heterointerface supercell with INKPROJ: LZLAYER = .TRUE. zlay1 = 0.13 zlay2 = 0.29 MAT_P2S = 7 3 0 \ 4 7 0 \ 0 0 1 LSORBIT = .TRUE. the WAVECAR is about 49 GB but kproj is only using about 6.1 GB total across 36 threads, according to the top command, so the memory consumption is not terrible overall. I was able to reproduce the 10x speedup.

I tried adding an !$OMP PARALLEL on the line before my $OMP PARALLEL DO, but the compiler (ifort) threw an error ifort -O3 -free -xhost -mkl -qopenmp -CB -traceback -c mod_kproj.f mod_kproj.f(491): error #7918: A RETURN statement is invalid as an exit from a parallel region. return ------^ mod_kproj.f(498): error #7653: There are unterminated OpenMP directive block[s]. WRITE(IUO,'(5X,A)')'ERR: while reading coef. from WAVECAR' ^ mod_kproj.f(493): error #7677: This label has been branched to from outside this PPD or OpenMP block. [220] 220 CONTINUE --^ compilation aborted for mod_kproj.f (code 1) make: *** [mod_kproj.o] Error 1

I don't think it liked the syntax of having two commands to begin the parallel section, but only one OMP END PARALLEL DO. I also tried adding another OMP END PARALLEL after OMP END PARALLEL DO, but it errored with an abort trap signal - I think it considers that to be nested parallelism, but with the wrong syntax.

jinxin68 commented 2 years ago

I got faster speed during using "ulimit -s unlimited" and "export OMP_STACKSIZE=1024m" . As you said ,it's a memory problem . But usually users don't take the initiative to set these. I will try to find a better solution. Thank you for your detailed and useful suggestions !

jinxin68 commented 2 years ago

I try to move "fft_allocate and allocate(q_tmp)" and add "deallocate(q_pw,q_tmp)" into parallel region to aovid memory problem . Though this solves the probelm ,it will slow the parallel speed by 30 % . Because large arrays are allocated and deallocated repeatedly in every threads . Now ,I am quite sure that the memory problem is arised from "q_pw " and "q_tmp"

sthartman commented 2 years ago

That makes sense, the default stacksize is around 4 MB I think, it will be very hard to contain the thread within that size. What are the dimensions of q_pw? I couldn't tell from looking at the fft_allocate subroutine.

If the concern is only that less experienced users will not know to set omp_stacksize correctly, it might be easier just to leave the standard compiler options in the makefile, with a commented-out alternate line which has the -qopenmp option also. If kproj compiled without the -qopenmp option, the compiler should ignore the openmp directives in mod_kproj.f and compile serial code. That way, a user will only get the parallel version if they know what they are doing and choose it on purpose.

jinxin68 commented 2 years ago

The size of q_pw is usually about 100 100 100 * nspinor, which is related to the energy cutoff.I have added ”-qopenmp” in makefile.

We can estimate the size of q_pw, and then add the corresponding ompstack size in makefile.Do you mean this？

sthartman commented 2 years ago

I don't know if it's possible to set environment variables in the makefile, but if it is, that would be the simplest way to do it. Normally I set it as a runtime environment variable in my SLURM batch script, and I just set it to some large value like 1024m. As far as I know, there's no drawback to setting a large value as long as OMP_STACKSIZE x OMP_NUM_THREADS doesn't exceed the computer's memory.

mxchen-2020 / kproj

[Feature suggestion] OpenMP Parallelization #1