Design of a universal base DFT task document

utf commented 9 months ago

A nice feature of atomate2 is the universal workflows which can be run using multiple DFT calculators (e.g., elastic, phonon, defect workflows). As the number of DFT codes grows (currently including VASP, CP2K, FHI-aims, Abinit, QChem), there is a need to standardise the output task documents to maintain interoperability.

I'm opening this PR so we can start a discussion about what fields should be included in the base schema. Obviously, we should not break the existing VASP Task Document, so I've listed the fields below as a starting point. This issue is related to #145.

Some open questions:

How much should the schemas for molecular and crystal DFT packages overlap?
How do we handle useful output properties such as dielectric constants, and born effective charges (needed to standardise phonon workflow) which are currently in calcs reversed?

TaskDoc:
- tags
- dir_name
- state
- calcs_reversed: Calculation (VASP specific)
- structure
- task_type
- task_id
- orig_inputs
  - incar
  - poscar
  - kpoints
  - potcar
- input
  - structure
  - parameters
  - pseudo_potentials
  - potcar_spec
  - xc_override
  - is_lasph
  - is_hubbard
  - hubbards
- output
  - structure
  - density
  - energy
  - forces
  - stress
  - energy_per_atom
  - bandgap
- included_objects
- vasp_objects
- entry
- task_label
- icsd_id
- transformations
- additional_json
- custodian
  - corrections
  - job
- analysis
  - delta_volume
  - delta_volume_percent
  - max_force
  - warnings
  - errors
- last_updated

tsmathis commented 6 months ago

@utf , we merged in some of @esoteric-ephemera's work on this in emmet (#971) yesterday as a checkpoint for me to start working on up-versioning the schema of all the task docs MP's existing tasks collection.

A couple outstanding questions on my end:

What is the purpose of an additional_json field?
The included_objects and vasp_objects fields seem to duplicate info to some degree? Maybe I'm just missing the use case, but couldn't the enum field + list of entries field be combined into a single key, value field?

The fields in question, for quick reference:

    included_objects: Optional[List[VaspObject]] = Field(
        None, description="List of VASP objects included with this task document"
    )
    vasp_objects: Optional[Dict[VaspObject, Any]] = Field(
        None, description="Vasp objects associated with this task"
    )

utf commented 6 months ago

Hi @tsmathis. Great to hear this is being picked up.

The additional_json field can be used to store any additional json files that are present in the calculation directory. We already have specific fields for transformation and custodian json files, but this is a catch all field for anything else.
The reason for having included_object was to make querying easier. E.g., {"included_objects": "dos"} to query all tasks with density of states. I suppose this could also be achieved with {"vasp_objects.dos": {"$exists" : True}}. But my suggestion is to leave this in place, since it has been in production in atomate2 for a while and people may rely on it for their workflows/analysis scripts.

tsmathis commented 6 months ago

Okay, got it.

For 1., I am going to take a pretty hard stance against allowing users to inject arbitrary data into MP's base task doc. I would rather lean towards first pushing users towards creating a new derived document schema for their workflow. And if the workflow becomes pervasive enough to be a default, MP could then add the new fields explicitly to the base task doc schema (e.g., transformations, etc.). TLDR: I am open to adding fields, but I want them to be explicit.

For 2. I agree that ease of query-ability is important, so I am somewhat neutral. But @esoteric-ephemera I think has an idea of how he might want those fields populated/formatted to reduce duplication

utf commented 6 months ago

For 1, that's perfectly fine for MP but I believe that should be handled during ingestion of task docs, rather than as limitation to atomate2. Just to note that ALL vasp jobs have the same schema, even for different calculation types, so creating a new schema just for one specific VASP calculation wouldn't be a realistic solution. This is already a feature I use for some of my own workflows, so I'd rather it not be removed.

A general point is that these task documents have been in use in atomate2 for quite some time. I would strongly recommend that you don't make any breaking changes to avoid disrupting existing workflows.

JaGeo commented 6 months ago

I would also like to emphasize that the atomate2 use cases go much beyond MP data creation, even though this is indeed a very important use case. Thus, I would also recommend that major breaking chances should only be done very carefully

esoteric-ephemera commented 6 months ago

Hi @utf and @JaGeo:

For (1), since the TaskDoc schema in emmet allows extra fields, the user could add an additional_json top-level attr to TaskDoc even if it's not present in the base schema. From the MP side, I think it'd be preferable to move additional_json to atomate2.

No breaking changes then, but additional fields would be populated for atomate2 calcs that might later be stripped on ingestion by MP

For (2), I think it makes sense to have included_objects be a __post_init__ that is populated by the vasp_objects field. The queryability is nice but it doesn't necessarily make sense to input the former field by hand when the latter is populated by the drone

There's already similar logic implemented for atomate2's ForcefieldTaskDocument

utf commented 6 months ago

Agreed about point 2, that makes a lot of sense.

My issue with point 1 is that we have a schemas for several reasons: i) to have documentation of what the fields contain, ii) to have a way of looking up what fields are present, and iii) to validate the task document. This suggestion breaks all of those reasons.

Can I ask the motivation for removing it?

tsmathis commented 6 months ago

On MP's side, we are trying 1) to standardize our schema and 2) transition to using TaskDoc for everything. Including updating the schema for the existing tasks collection on MP ahead of our full re-compute. The TaskDoc schema has downstream effects on how MP is now parsing and storing the task documents as well as using them in builders (hence why we are digging into this now).

A lot of the legacy tasks in MP's collection use outdated schemas with missing fields - none of these have additional_json fields.

So, yes, we don't want to inhibit atomate2 users from extending the model for their own use cases, as was mentioned about the extra="allow" in the base TaskDoc model, and so removing the additional_json field won't break things on the atomate2 user side.

For your points (And my thoughts here are strictly from the emmet side, since the emmet base is TaskDoc): i. "to have documentation of what the fields contain" from the additional_json doc string:

  additional_json: Optional[Dict[str, Any]] = Field(
      None, description="Additional json loaded from the calculation directory"
  )

That doesn't actually explain what is contained in that field, since anything JSONable is allowed.

ii. "to have a way of looking up what fields are present". I don't understand what the value of knowing that a catch-all additional_json field is present provides. Are we supposed to manually check each document with and additional_json, see what it's in there, then proceed based on the contents?

iii. "to validate the task document." If anything is allowed in the additional_json field, how can it be used to validate the task document?

And to be clear, I do agree MP should be handling this during ingestion, which is why I would push for the additional_json field to be removed from emmet, since that is the MP side and we could then do what would amount to "populate MP's task doc with only the fields defined here, in emmet"

And correct me if I'm wrong, but the theme of this gh issue is to define a common code agnostic schema for atomate2 workflows? So then shouldn't that schema inherit from the emmet classes/base schemas?

utf commented 6 months ago

Just to confirm, is the main reason you want to remove it for cosmetic reasons?

I take your point about validation but I disagree with the other reasons. The description describes what the field should contain, and I often navigate the schemas on emmet to find where a particular field lives.

A broader issue I have is that we have entrusted a lot of power to emmet by putting the task documents there. Atomate2 is already constantly breaking because someone changed something in pymatgen or emmet without considering the downstream consequences (and for this reason, I'm very appreciative that you've posted your questions on this thread). I'd really like to avoid any unnecessary changes to such a core part of the atomate2 code base. Modifying things for cosmetic reasons is not a strong justification in my opinion.

esoteric-ephemera commented 6 months ago

Hi @utf, the main reasons for removing the additional_json field are:

This field will be empty for all tasks that exist in MP, and will be empty for all new tasks that get added to MP
Whereas other fields like forces have a fixed scheme that permits easy validation, the additional_json field has a very free-form scheme (anything JSON-compliant). That makes verifying the contents of additional_json as being relevant and beneficial for some aspect of the calculation challenging. This is very relevant for MP as we prepare to accept outside data (pending validation etc.)

Since the emmet BaseModel permits extra model fields, it seems mostly harmless to move the additional_json field to a (code-agnostic) atomate2 TaskDoc subclass of emmet's TaskDoc.

I'll also note there are identical parse_additional_json functions in atomate2 and emmet, so making a decision on this field also reduces code duplication.

materialsproject / atomate2

Design of a universal base DFT task document #741