Open utf opened 9 months ago
@utf , we merged in some of @esoteric-ephemera's work on this in emmet (#971) yesterday as a checkpoint for me to start working on up-versioning the schema of all the task docs MP's existing tasks collection.
A couple outstanding questions on my end:
additional_json
field? included_objects
and vasp_objects
fields seem to duplicate info to some degree? Maybe I'm just missing the use case, but couldn't the enum field + list of entries field be combined into a single key, value field?The fields in question, for quick reference:
included_objects: Optional[List[VaspObject]] = Field(
None, description="List of VASP objects included with this task document"
)
vasp_objects: Optional[Dict[VaspObject, Any]] = Field(
None, description="Vasp objects associated with this task"
)
Hi @tsmathis. Great to hear this is being picked up.
additional_json
field can be used to store any additional json files that are present in the calculation directory. We already have specific fields for transformation and custodian json files, but this is a catch all field for anything else.included_object
was to make querying easier. E.g., {"included_objects": "dos"}
to query all tasks with density of states. I suppose this could also be achieved with {"vasp_objects.dos": {"$exists" : True}}
. But my suggestion is to leave this in place, since it has been in production in atomate2 for a while and people may rely on it for their workflows/analysis scripts.Okay, got it.
For 1., I am going to take a pretty hard stance against allowing users to inject arbitrary data into MP's base task doc. I would rather lean towards first pushing users towards creating a new derived document schema for their workflow. And if the workflow becomes pervasive enough to be a default, MP could then add the new fields explicitly to the base task doc schema (e.g., transformations, etc.). TLDR: I am open to adding fields, but I want them to be explicit.
For 2. I agree that ease of query-ability is important, so I am somewhat neutral. But @esoteric-ephemera I think has an idea of how he might want those fields populated/formatted to reduce duplication
For 1, that's perfectly fine for MP but I believe that should be handled during ingestion of task docs, rather than as limitation to atomate2. Just to note that ALL vasp jobs have the same schema, even for different calculation types, so creating a new schema just for one specific VASP calculation wouldn't be a realistic solution. This is already a feature I use for some of my own workflows, so I'd rather it not be removed.
A general point is that these task documents have been in use in atomate2 for quite some time. I would strongly recommend that you don't make any breaking changes to avoid disrupting existing workflows.
I would also like to emphasize that the atomate2 use cases go much beyond MP data creation, even though this is indeed a very important use case. Thus, I would also recommend that major breaking chances should only be done very carefully
Hi @utf and @JaGeo:
For (1), since the TaskDoc schema in emmet allows extra fields, the user could add an additional_json
top-level attr to TaskDoc
even if it's not present in the base schema. From the MP side, I think it'd be preferable to move additional_json
to atomate2.
No breaking changes then, but additional fields would be populated for atomate2 calcs that might later be stripped on ingestion by MP
For (2), I think it makes sense to have included_objects
be a __post_init__
that is populated by the vasp_objects
field. The queryability is nice but it doesn't necessarily make sense to input the former field by hand when the latter is populated by the drone
There's already similar logic implemented for atomate2's ForcefieldTaskDocument
Agreed about point 2, that makes a lot of sense.
My issue with point 1 is that we have a schemas for several reasons: i) to have documentation of what the fields contain, ii) to have a way of looking up what fields are present, and iii) to validate the task document. This suggestion breaks all of those reasons.
Can I ask the motivation for removing it?
On MP's side, we are trying 1) to standardize our schema and 2) transition to using TaskDoc for everything. Including updating the schema for the existing tasks collection on MP ahead of our full re-compute. The TaskDoc schema has downstream effects on how MP is now parsing and storing the task documents as well as using them in builders (hence why we are digging into this now).
A lot of the legacy tasks in MP's collection use outdated schemas with missing fields - none of these have additional_json
fields.
So, yes, we don't want to inhibit atomate2 users from extending the model for their own use cases, as was mentioned about the extra="allow"
in the base TaskDoc model, and so removing the additional_json
field won't break things on the atomate2 user side.
For your points (And my thoughts here are strictly from the emmet side, since the emmet base is TaskDoc):
i. "to have documentation of what the fields contain"
from the additional_json
doc string:
additional_json: Optional[Dict[str, Any]] = Field(
None, description="Additional json loaded from the calculation directory"
)
That doesn't actually explain what is contained in that field, since anything JSONable is allowed.
ii. "to have a way of looking up what fields are present".
I don't understand what the value of knowing that a catch-all additional_json
field is present provides. Are we supposed to manually check each document with and additional_json
, see what it's in there, then proceed based on the contents?
iii. "to validate the task document."
If anything is allowed in the additional_json
field, how can it be used to validate the task document?
And to be clear, I do agree MP should be handling this during ingestion, which is why I would push for the additional_json
field to be removed from emmet, since that is the MP side and we could then do what would amount to "populate MP's task doc with only the fields defined here, in emmet"
And correct me if I'm wrong, but the theme of this gh issue is to define a common code agnostic schema for atomate2 workflows? So then shouldn't that schema inherit from the emmet classes/base schemas?
Just to confirm, is the main reason you want to remove it for cosmetic reasons?
I take your point about validation but I disagree with the other reasons. The description describes what the field should contain, and I often navigate the schemas on emmet to find where a particular field lives.
A broader issue I have is that we have entrusted a lot of power to emmet by putting the task documents there. Atomate2 is already constantly breaking because someone changed something in pymatgen or emmet without considering the downstream consequences (and for this reason, I'm very appreciative that you've posted your questions on this thread). I'd really like to avoid any unnecessary changes to such a core part of the atomate2 code base. Modifying things for cosmetic reasons is not a strong justification in my opinion.
Hi @utf, the main reasons for removing the additional_json
field are:
forces
have a fixed scheme that permits easy validation, the additional_json
field has a very free-form scheme (anything JSON-compliant). That makes verifying the contents of additional_json
as being relevant and beneficial for some aspect of the calculation challenging. This is very relevant for MP as we prepare to accept outside data (pending validation etc.)Since the emmet BaseModel permits extra model fields, it seems mostly harmless to move the additional_json
field to a (code-agnostic) atomate2 TaskDoc
subclass of emmet's TaskDoc
.
I'll also note there are identical parse_additional_json
functions in atomate2 and emmet, so making a decision on this field also reduces code duplication.
A nice feature of atomate2 is the universal workflows which can be run using multiple DFT calculators (e.g., elastic, phonon, defect workflows). As the number of DFT codes grows (currently including VASP, CP2K, FHI-aims, Abinit, QChem), there is a need to standardise the output task documents to maintain interoperability.
I'm opening this PR so we can start a discussion about what fields should be included in the base schema. Obviously, we should not break the existing VASP Task Document, so I've listed the fields below as a starting point. This issue is related to #145.
Some open questions: