phoebe-team / phoebe

A high-performance framework for solving phonon and electron Boltzmann equations
https://phoebe-team.github.io/phoebe/
MIT License
89 stars 19 forks source link

electronWannierTransport calculations crash #225

Open SiyuChen opened 1 month ago

SiyuChen commented 1 month ago

Hi, I am doing a electron transport calculation with Phoebe while it unfortunately crashed somehow.

The following is the screen outputs before it crashed:

Started parsing of el-ph interaction. Allocating 0.08497238 (GB) (per MPI process) for the el-ph coupling matrix. Finished parsing of el-ph interaction.

Computing electronic band structure.

Statistical parameters for the calculation Fermi level: 8.30632019 (eV) Index, temperature, chemical potential, doping concentration iCalc = 0, T = 150.000000 (K), mu = 8.431200 (eV), n = -2.680797e+21 (cm^-3)

Applying a population window discarding states with df/dT < 1.000000e-10. Window selection reduced electronic band structure from 1755000 to 421334 states. Symmetries reduced electronic band structure from 421334 to 107650 states. Done computing electronic band structure.

Snapshot of Phoebe's memory usage: VM: 106.0459 (GB). RSS: 51.3745 (GB)

Computing phonon band structure. Allocating 0.0981 GB (per MPI process).

After this, it crashed with throwing the following error:

terminate called after throwing an instance of 'std::length_error' what(): vector::_M_fill_insert terminate called after throwing an instance of 'std::length_error' what(): vector::_M_fill_insert

Do you have any clue regarding this error?

jcoulter12 commented 1 month ago

Hi Siyu,

Glad to hear you're using the code. Let's see if we can get to the bottom of this.

First, can you confirm for me that you have the most recent version of the code, (just run git pull to be sure) and give me a little info about the resources you used to run this calculation?

Also does this happen if you use a smaller kpoint mesh? I just want to make sure nothing is overflowing or running out of memory first.

Thanks, Jenny

SiyuChen commented 1 month ago

Hi Jenny

Thank you! I confirm that I am using the most recent version of the code.

git log
commit 9b667516b70f1baaad33dce1ae86acc884a225a2 (HEAD -> develop, origin/develop, origin/HEAD)
Merge: 4f0e6195 0a85a21f
Author: Jenny Coulter <jcoulter@flatironinstitute.org>
Date:   Mon Aug 12 17:31:17 2024 -0400

    Merge pull request #221 from mir-group/sgplibCMakeFix

    Update spglib to use FetchContent in CMake

I also confirm that the calculation can finish properly if a smaller k-mesh is used. For example, with kMesh = [15,27,39], the job can be done using 5 compute nodes (More specifically, 15 MPI processes x 18 OpenMP threads). I am using a cluster in which each node has 56 cpus and 384 GiB of RAM, that is 6840 MB per cpu.

However, with kMesh = [20,36,52], my phoebe will always crash with the abovementioned error, no matter how I increase the memory. I have tried to launch the job with 10 nodes (10 MPI processes x 56 OpenMP threads, maximizing the memory available to each MPI process), but it still does not work.

Happy to provide more information if needed.

Best wishes Siyu

jcoulter12 commented 1 month ago

Hi Siyu,

Indeed, the code should be able to scale that far in kMesh (as well as pretty far beyond that -- I've been able to run kMesh = [350,350,350] for some materials). The memory content of your job also doesn't seem very large.

I have some ideas about what might be happening, and likely it's going to be a super minor fix on my part. I can probably fix this in the next day or so.

I'm sure you don't want to share your data broadly, but if you are willing to let me look your files we can communicate by email. You can see my email address is listed under my on my Github account page under my name and photo -- please write and I'll provide a place for you to upload the data.

Thanks for reporting this, we appreciate it when users let us know about these things. Jenny

jcoulter12 commented 1 month ago

Hi Siyu,

Would you mind checking out the branch named activeBandsVelocitiesOverflowBug? You can do this by going to your phoebe directory and typing:

git pull
git checkout activeBandsVelocitiesOverflowBug
cd build
make phoebe

and then go into your build directory and type "make" again. Let me know if this does not fix your issue somehow, but I was able to reproduce and see it fixed on my machine.

Basically it as I suspected -- your system is so big, you managed to overflow an integer argument storing the number of band velocities for the phonon band structure :). I just had to increase it from int -> size_t.

There is a chance you could encounter some other such error just because you have so many phonon bands. Let me know if something else fails, usually these issues are quite fast to find (and ideally fix).

Best, Jenny

SiyuChen commented 1 month ago

Hi Jenny,

I have done what you said here. However, now my phoebe throws segmentation faults:

srun: error: cpu-p-252: task 10: Segmentation fault srun: error: cpu-p-252: task 9: Segmentation fault srun: error: cpu-p-252: task 11: Segmentation fault srun: error: cpu-p-251: task 8: Segmentation fault srun: error: cpu-p-597: task 12: Segmentation fault

Can you see your phoebe output "started computing scattering matrix"? The screen output of my phoebe still gets stuck at "Computing phonon band structure. Allocating 0.0502 GB (per MPI process)."

jcoulter12 commented 1 month ago

Hi Siyu,

Ok, I was able to reproduce this -- for me, it's not a seg fault but a very reasonable out of memory error. This is because you have 180 phonon bands, 426465 qpoints, and for each group velocity (3 dimensions), which is complex, (16 bytes) this means allocating a container to store the band velocities which is 663 gigabytes.

This takes a bit more work to get around, but it could possibly be done. A fast workaround may be to reduce the size of the population window, like this:

windowType = "population"
windowPopulationLimit = 1e-6

however, this can be dangerous if one wants to use the Wigner correction, as for that, contributions can come from far away from the Fermi energy (I think this is noted in the tutorial as well) and in general one also should then converge wrt the window population limit. It would still give you an idea of if the RTA is already converged here.

I think there is a workaround to this that I've wanted to implement anyway. Let me investigate the difficulty of that change and get back to you in ~ a day.

Jenny