Closed JonathanSchmidt1 closed 6 months ago
Hi @JonathanSchmidt1, thanks very much for raising this issue.
For debugging purposes, would you be able to send me the CONTCAR
, vasprun.xml
and OUTCAR
files for a couple of the failed relaxations. Also, could you share which workflow you ran and whether you made any modifications to the default settings.
I used the StaticMaker (so no relaxations I was just talking about electronic steps) with modified INCAR, POTCAR and KPOINTS. I can share the files and FW.jsons with the parameter updates. How would you like me to share the files as they are too large for github or email. gdrive?
In that case, I'm quite surprised that the task document is so large. I wonder if it is the orbital projections that is causing the task document to be so large. If you could upload them to gdrive and share them with me over email (a [dot] ganose [at] imperial [dot] ac [dot] uk) that would be great!
I send you an email with two examples. Thank you for taking a look. Just checking the outputs of the successful calculations the majority of the memory seems to be taken up by ['calcs_reversed'][0]['output']['outcar']['onsite_density_matrices']. I assume the onsite density matrices are saved for every electronic step so they become larger for the longer calculations e.g. reading them from the OUTCAR for the failed example it's already 12 MB. Is there anything else in the output that scales with the number of electronic steps?
I've seen this before. I think, this field is not stored intentionally, but rather as a consequence of pymatgen
parsing the data via the Outcar
and this being stored by default, since it is not otherwise specified in the schema.
I cannot find a reference to the data being used anywhere in the MP stack (atomate v1 or v2, emmet), except indeed to remove the fields if the document is too large (this looks like @tschaume's work). If someone can confirm these aren't being used and can't thing of a motivated reason to keep them, perhaps a sensible solution is to disable storing by default, and add a kwarg for the VaspDrone
in emmet instead, for the users for whom it's important?
Otherwise, perhaps the TaskDoc
could be slimmed down and more data stored via the data_store
instead?
Is there anything else in the output that scales with the number of electronic steps?
I like the idea of annotating the TaskDoc
schema with scaling with system size / number of steps too, it might be helpful. It does happen where people see issues with e.g. many ionic steps too.
Thanks very much for sharing the files. Agreed that the issue is entirely due to calcs_reversed.0.output.outcar.onsite_density_matrices
which accounts for:
The actual size of the remaining task doc is tiny in comparison (~0.4 MB in both cases). As @mkhorton said, the onsite_density_matrices
is not explicitly listed in the task document schema. Instead, it just enters because we store the entire serialised Outcar
object.
The main options seem to be:
onsite_density_matrices
in the data store (e.g., it add it to the list of data objects).onsite_density_matrices
from the TaskDoc entirely, since it seems like we never explicitly asked for it.If there is no objection, I would prefer to go for option 2. This also raises the question of whether we should be blindly storing everything from the Outcar
object, or whether we should explicitly select what we'd like to store. I'll note that the blind storage approach has been useful for me in the past. I was able to extract core state eigenvalues from the outcar field, which likely wouldn't have been explicitly selected since they aren't generally useful.
@mkhorton I just looked at the list of fields that are removed if the task document is too large in the emmet cli:
normalmode_eigenvecs
force_constants
outcar.onsite_density_matrices
Of these, we already already store the first two in the data store. See: https://github.com/materialsproject/atomate2/blob/af667d8385da42e532a72f77021c23c17c62ac80/src/atomate2/vasp/jobs/base.py#L39-L52
However, IMO the first two are more immediately useful for phonon workflows, whereas I'm not sure whether anyone uses onsite_density_matrices
Yeah, we've been removing these 3 fields from the task docs for a few years now and have never had an issue come up where we thought that it would have been better to keep them :) Since the first two are already in the data store, it seems that it would make sense to put the onsite_density_matrices
in there, too.
I agree with removing it (ideally, perhaps configurable).
In terms of the “blind storage” approach, I agree I’ve often found data useful that was stored that I didn’t initially intend to store. Perhaps a compromise is just to explicitly specify these fields in the schema, such that if a new field is added later, it will trigger a validation error and we can make the explicit decision whether to include it or not? This would avoid cases where we’re accidentally storing additional fields.
@JonathanSchmidt1, would you be willing to submit a PR to emmet making the storage of onsite_density_matrices
optional and turned off by default?
Describe the bug I ran around 1200 calculations of 160 atom cells and 3 of the calculations converged successfully but then fizzled due to the size of the result. the resulting objects were between 18-22 MB
The only common factor that I could determine is, that the calculations that fizzled took more than 378 electronic steps while all the successful ones were below 305. I also confirmed that if I query for the output of the successful calculations it's significantly larger for calculations that took more steps. However even the output of a successful calculation that took 300 steps only takes up around 10MB as a text file (dictionary printed to file).
To Reproduce I would guess: run a collinear calculation of a structure with more than 160 atoms that takes more than 380 steps. If anyone is interested in repeating some of the failed calculations I can provide the input files.
Expected behavior I would expect that if the object is larger than 16 MB it should be saved in the data store instead of producing an error. But maybe there is good reason not to have that behavior.