Dealing with large structures in the phonon workflow for forcefields

JaGeo commented 6 months ago

The current Implementation of the phonon workflow works very well for crystalline structures and DFT. However, in the case of forcefields and larger structures without symmetry, a single job doing everything might be preferrable. Saving all intermediate structure in any kind of database/file takes up more time than the actual computation and might require a massive amount of memory.

I think it would be nice to have such a job in atomate2 as the job could reuse many of the current implementions and functions.

Tagging @gpetretto , as we have recently discussed this in the context of jobflow-remote.

@utf: any comments?

JaGeo commented 6 months ago

Maybe also a more general question but more related to jobflow: can we implement functionality to automatically connect to jobs to one after definition?

gpetretto commented 6 months ago

Thanks @JaGeo for opening this. I am linking the jobflow-remote issue, as there is a more detailed description of the problems encountered: https://github.com/Matgenix/jobflow-remote/issues/79. And I believe this is also linked to #515.

As mentioned in the other issues, when dealing with forcefields the size and the number of the structures involved is going to be much larger than the typical sizes that we are used to in DFT calculations. This implies a much larger memory footprint and having to deal with i/o for large data sets. So, while JSON serialization is a very practical choice for standard workflows, it quickly becomes a bottleneck as the size increse. In the case of the phonon workflow just calling jsanitize on the list of 24000 structures may take hours. In addition, would it be really worth stuffing the output Store with GBs of data that are likely useless and very fast to regenerate, if needed?

In my opinion this calls for different a different approach. The first and simple solution is definitely to reduce the number of jobs to minimize the amount of read/write operations from the DB. However, as @JaGeo pointed out in the jobflow-remote issue, each job may still take some time, depending on the kind of potential used. Another issue might be that the memory will increase over time, as forces for more Structures are calculated. This means that the job could go out of memory even after a long running time. But this might be a good solution for a large range of structure sizes.

However, I think that another possibility would be to start treating these big data as we would treat a charge density or wavefunction file. I don't have a complete solution, but for example, for the phonon use case we may consider this:

create a new way of serializing the Structure that is compatible with binary file formats, more suitable to deal with large data (e.g. HDF5)
when generating all the perturbed structures, instead of converting the list of structures to json, dump them one by one to this binary file. This would avoid the serialization of a huge block of data in a single shot and should be way faster than dumping to a zipped JSON file.
Optionally insert this file in a specific File store (see #515 for what I mean with a file store), but I don't think it is really needed in this case.
Instead of passing a structure to each of the perturbation Jobs (this would require a large amount of memory in the queue/fireworks DB), pass the path to the folder containing the file and the index of the structure to be used (more or less as is done for CHGCAR or WAVECAR files). When subsequent Jobs need to have access to a Structure, read it from the file. If a good format is chosen it should be possible to access each structure independently and may be faster than fetching it from the DB.

I did not have time to make any test on this kind of approach, but the key points are to avoid reading and dumping the full list of structures as much as possible, using faster format and avoid storing very large data in the DB, if they are not really useful.

One more point that it would be worth testing in the case of the phonon flow is how would phonopy behave in the postprocessing part when dealing with 24000 structures. Maybe also in that case the memory could be a bottleneck. It may be worth making a test running all the steps in a single Job and leaving the last one aside, so it would be possible to benchmark phonopy's requirements separately. A very fast Calculator could be used to do this (e.g. EMT). I suppose it might be convenient to know before starting with a real calculation and a proper potential. Did you maybe already do such a test @JaGeo?

JaGeo commented 6 months ago

@gpetretto independent of this particular use case, implementing a job with phono3py would require similar considerations. Phono3py can definitely deal with such a large number of structures.

gpetretto commented 6 months ago

Good point. Then to me this is even a more compelling reason to focus on a solution that would allow to cover a larger number of cases. If you have 1000000 structures, running them sequentially will not be an option, even if each calculation requires few seconds. Assuming that phono3py can then deal with this number of structures in the postprocessing. It has been some time since I last used it, but it seemed to take quite some time to run even in cases with roughly one thousand structures.

JaGeo commented 6 months ago

@gpetretto I have seen posters where people did several ten thousand structures with DFT. I haven't tried it myself. It might still require a large amount of memory, though. And, agreed, we need to think about a different implementation.

materialsproject / atomate2

Dealing with large structures in the phonon workflow for forcefields #754