Help with efficient parallel settings for benchmark ED calculations

vamshimohank commented 4 years ago

Dear Authors,

First of all thanks for making this very efficient code available for the community.

I am working on benchmarking our rather newly developed method based on FCIQMC to solve the J1-J2 Heisenberg model on a square lattice. I want to use HPhi ED solver as the benchmark.

I have been running calculations with HPhi for lattice sizes up to 36 sites to get total energy but I am not sure if I am employing the best parallel mode of calculation. Since I want to compare not just the total energies but also the computational requirements, I want to make sure that I am running the ED calculations with HPhi is the most efficient fashion.

I will be grateful if you can share the computational efficiency benchmarks if they are already done. If not, I am willing to do them and I would appreciate some help from the implementors. This way, we are sure that the code is used in the best possible way.

We are also open to doing this in collaboration if there is interest in the developers.

Thanks and Regards Vamshi M Katukuri Scientist, Max-Planck Insitute for Solid State Physics Germany

tmisawa commented 4 years ago

Dear Dr. Katukuri

Thank you for having an interest in HPhi.

In our paper of HPhi (M. Kawamura et al.CPC 2017, https://www.sciencedirect.com/science/article/pii/S0010465517301200?via%3Dihub ), we show the benchmark results on the computational efficiency for the 18-site Hubbard model and the 36-site Heisenberg model on the kagome lattice (see Fig.6, 7 and Table 1).

From these benchmarks, in the standard intel CPU, we find that the process-major computation is faster than the thread-major one. Namely, for 1536 = 24*128 cores, process-major computation (256 processes with 6 threads) is faster than thread-major computation (64 processes with 24 threads).

I hope that this information will be helpful for you.

Best, Takahiro Misawa

vamshimohank commented 4 years ago

Dear Dr. Misawa,

Thank you for your reply. The information you provided is certainly helpful. What I understand is that the available RAM per core is the limitation. Nodes with a rather small RAM can only use the additional cores as OpenMP threads.

I am particularly interested in the memory requirement and time consumption for computing the eigenvectors. I presume that this has a huge requirement of memory. Is there any ballpoint figure for the memory requirements?

Regards Vamshi

tmisawa commented 4 years ago

Dear Dr. Katukuri

I am sorry for the late reply.

Below, I attached a slide for the estimate of the necessary vectors for the Lanczos method and the CG method. In addition to that, in HPhi, we use an additional vector with double precision and one integer vector.

Thus, in summary, for the Lanczos method, 6.5 (3.5) double precision vectors are necessary for obtaining the eivenvector (the eigenvalue). For the CG method, 7.5 vectors are necessary.

For 36 site system with Sz=0, the total Hilbert dimension is estimated as Binomial(36,18)^2 ~ 9×10^9. So, for example, 910^98(bite)*(3.5) ~ 254 GB is necessary for obtaining the eigenvalue with the Lanczos method. If you use the CG method, about 544 GB is necessary.

Best, Takahiro Misawa

memory.pdf

issp-center-dev / HPhi

Help with efficient parallel settings for benchmark ED calculations #101