BUG: Static vasp job failing due to BSONObj size

JonathanSchmidt1 commented 8 months ago

Describe the bug I ran around 1200 calculations of 160 atom cells and 3 of the calculations converged successfully but then fizzled due to the size of the result. the resulting objects were between 18-22 MB

"_stacktrace": "Traceback (most recent call last):\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/maggma/stores/mongolike.py\", line 404, in update\n    self._collection.bulk_write(requests, ordered=False)\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/_csot.py\", line 106, in csot_wrapper\n    return func(self, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/collection.py\", line 548, in bulk_write\n    bulk_api_result = blk.execute(write_concern, session)\n                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/bulk.py\", line 514, in execute\n    return self.execute_command(generator, write_concern, session)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/bulk.py\", line 391, in execute_command\n    client._retry_with_session(self.is_retryable, retryable_bulk, s, self)\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/mongo_client.py\", line 1360, in _retry_with_session\n    return self._retry_internal(retryable, func, session, bulk)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/_csot.py\", line 106, in csot_wrapper\n    return func(self, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/mongo_client.py\", line 1401, in _retry_internal\n    return func(session, sock_info, retryable)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/bulk.py\", line 385, in retryable_bulk\n    self._execute_command(\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/bulk.py\", line 338, in _execute_command\n    result, to_send = bwc.execute(cmd, ops, client)\n                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/message.py\", line 841, in execute\n    result = self.write_command(cmd, request_id, msg, to_send)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/helpers.py\", line 279, in inner\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/message.py\", line 920, in write_command\n    reply = self.sock_info.write_command(request_id, msg, self.codec)\n            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/pool.py\", line 969, in write_command\n    helpers._check_command_response(result, self.max_wire_version)\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/pymongo/helpers.py\", line 194, in _check_command_response\n    raise OperationFailure(errmsg, code, response, max_wire_version)\npymongo.errors.OperationFailure: BSONObj size: 18424246 (0x11921B6) is invalid. Size must be between 0 and 16793600(16MB) First element: q: { uuid: \"19b27480-76df-45ba-8b66-d3c81e16c936\", index: 1 }, full error: {'ok': 0.0, 'errmsg': 'BSONObj size: 18424246 (0x11921B6) is invalid. Size must be between 0 and 16793600(16MB) First element: q: { uuid: \"19b27480-76df-45ba-8b66-d3c81e16c936\", index: 1 }', 'code': 10334, 'codeName': 'BSONObjectTooLarge'}\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/fireworks/core/rocket.py\", line 261, in run\n    m_action = t.run_task(my_spec)\n               ^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/jobflow/managers/fireworks.py\", line 160, in run_task\n    response = job.run(store=store)\n               ^^^^^^^^^^^^^^^^^^^^\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/jobflow/core/job.py\", line 607, in run\n    store.update(data, key=[\"uuid\", \"index\"], save=save)\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/jobflow/core/store.py\", line 328, in update\n    self.docs_store.update(dict_docs, key=key)\n  File \"/users/jschmidt/anaconda3/envs/fireworks/lib/python3.11/site-packages/maggma/stores/mongolike.py\", line 406, in update\n    if self.safe_update:\n       ^^^^^^^^^^^^^^^^\nAttributeError: 'MongoURIStore' object has no attribute 'safe_update'\n",

The only common factor that I could determine is, that the calculations that fizzled took more than 378 electronic steps while all the successful ones were below 305. I also confirmed that if I query for the output of the successful calculations it's significantly larger for calculations that took more steps. However even the output of a successful calculation that took 300 steps only takes up around 10MB as a text file (dictionary printed to file).

To Reproduce I would guess: run a collinear calculation of a structure with more than 160 atoms that takes more than 380 steps. If anyone is interested in repeating some of the failed calculations I can provide the input files.

Expected behavior I would expect that if the object is larger than 16 MB it should be saved in the data store instead of producing an error. But maybe there is good reason not to have that behavior.

utf commented 8 months ago

Hi @JonathanSchmidt1, thanks very much for raising this issue.

For debugging purposes, would you be able to send me the CONTCAR, vasprun.xml and OUTCAR files for a couple of the failed relaxations. Also, could you share which workflow you ran and whether you made any modifications to the default settings.

JonathanSchmidt1 commented 8 months ago

I used the StaticMaker (so no relaxations I was just talking about electronic steps) with modified INCAR, POTCAR and KPOINTS. I can share the files and FW.jsons with the parameter updates. How would you like me to share the files as they are too large for github or email. gdrive?

utf commented 8 months ago

In that case, I'm quite surprised that the task document is so large. I wonder if it is the orbital projections that is causing the task document to be so large. If you could upload them to gdrive and share them with me over email (a [dot] ganose [at] imperial [dot] ac [dot] uk) that would be great!

JonathanSchmidt1 commented 8 months ago

I send you an email with two examples. Thank you for taking a look. Just checking the outputs of the successful calculations the majority of the memory seems to be taken up by ['calcs_reversed'][0]['output']['outcar']['onsite_density_matrices']. I assume the onsite density matrices are saved for every electronic step so they become larger for the longer calculations e.g. reading them from the OUTCAR for the failed example it's already 12 MB. Is there anything else in the output that scales with the number of electronic steps?

mkhorton commented 8 months ago

I've seen this before. I think, this field is not stored intentionally, but rather as a consequence of pymatgen parsing the data via the Outcar and this being stored by default, since it is not otherwise specified in the schema.

I cannot find a reference to the data being used anywhere in the MP stack (atomate v1 or v2, emmet), except indeed to remove the fields if the document is too large (this looks like @tschaume's work). If someone can confirm these aren't being used and can't thing of a motivated reason to keep them, perhaps a sensible solution is to disable storing by default, and add a kwarg for the VaspDrone in emmet instead, for the users for whom it's important?

Otherwise, perhaps the TaskDoc could be slimmed down and more data stored via the data_store instead?

mkhorton commented 8 months ago

Is there anything else in the output that scales with the number of electronic steps?

I like the idea of annotating the TaskDoc schema with scaling with system size / number of steps too, it might be helpful. It does happen where people see issues with e.g. many ionic steps too.

utf commented 8 months ago

Thanks very much for sharing the files. Agreed that the issue is entirely due to calcs_reversed.0.output.outcar.onsite_density_matrices which accounts for:

12.2 MB / 12.5 MB for calculation "5661"
14.3 MB / 14.7 MB for calculation "5435"

The actual size of the remaining task doc is tiny in comparison (~0.4 MB in both cases). As @mkhorton said, the onsite_density_matrices is not explicitly listed in the task document schema. Instead, it just enters because we store the entire serialised Outcar object.

The main options seem to be:

Store onsite_density_matrices in the data store (e.g., it add it to the list of data objects).
Remove onsite_density_matrices from the TaskDoc entirely, since it seems like we never explicitly asked for it.

If there is no objection, I would prefer to go for option 2. This also raises the question of whether we should be blindly storing everything from the Outcar object, or whether we should explicitly select what we'd like to store. I'll note that the blind storage approach has been useful for me in the past. I was able to extract core state eigenvalues from the outcar field, which likely wouldn't have been explicitly selected since they aren't generally useful.

utf commented 8 months ago

@mkhorton I just looked at the list of fields that are removed if the task document is too large in the emmet cli:

normalmode_eigenvecs
force_constants
outcar.onsite_density_matrices

Of these, we already already store the first two in the data store. See: https://github.com/materialsproject/atomate2/blob/af667d8385da42e532a72f77021c23c17c62ac80/src/atomate2/vasp/jobs/base.py#L39-L52

However, IMO the first two are more immediately useful for phonon workflows, whereas I'm not sure whether anyone uses onsite_density_matrices

tschaume commented 8 months ago

Yeah, we've been removing these 3 fields from the task docs for a few years now and have never had an issue come up where we thought that it would have been better to keep them :) Since the first two are already in the data store, it seems that it would make sense to put the onsite_density_matrices in there, too.

mkhorton commented 8 months ago

I agree with removing it (ideally, perhaps configurable).

In terms of the “blind storage” approach, I agree I’ve often found data useful that was stored that I didn’t initially intend to store. Perhaps a compromise is just to explicitly specify these fields in the schema, such that if a new field is added later, it will trigger a validation error and we can make the explicit decision whether to include it or not? This would avoid cases where we’re accidentally storing additional fields.

utf commented 8 months ago

@JonathanSchmidt1, would you be willing to submit a PR to emmet making the storage of onsite_density_matrices optional and turned off by default?

materialsproject / atomate2

BUG: Static vasp job failing due to BSONObj size #671