theochem / iodata

Python library for reading, writing, and converting computational chemistry file formats and generating input files.
https://iodata.readthedocs.io/
GNU Lesser General Public License v3.0
124 stars 45 forks source link

Support for NOMAD JSON? #223

Open PaulWAyers opened 3 years ago

PaulWAyers commented 3 years ago

We should consider supporting the NOMAD databases JSON format. There are several advantages to this format, mostly its native interoperability with the NOMAD database and the large number of parsers and Python utilities it makes available to us.

I had thought about making sure NOMAD supports QCSchema, and that would work. But we could also support the Nomad format. I'm not sure which is better. NOMAD may be better for storing wavefunction data.

You can see more about the types of fields in the Nomad Schema here: https://nomad-lab.eu/prod/rae/test/gui/metainfo https://nomad-lab.eu/prod/rae/docs/metainfo.html https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-meta-info

Because JSON is big (and slow) Nomad uses an internal binary format called MessagePack: https://msgpack.org/ Toon suggested that msgpack should also be considered for the QCSchema in IOData.

I think @wilhadams may be well-positioned to assess pros vs. cons of QCSchema vs NomadSchema.

PaulWAyers commented 3 years ago

@wilhadams looked at this can came away impressed. We should definitely try to support reading/dumping NOMAD JSON. We might also look at helping NOMAD parse QCSchema, but it seems like NOMAD is far ahead of MolSSI on this one. It might be that the conversion (through IOData) of NOMAD to QCSchema is more helpful.

We should also investigate (perhaps the answer is obvious) whether a NOMAD-compatible JSON can be directly uploaded to/downloaded from the NOMAD database.

PaulWAyers commented 3 years ago

We should also probably think about supporting the the Materials Project https://materialsproject.org/ Their pymatgen utility supports Gaussian and Vasp (among others) so should not be so hard for us to use. I didn't figure out what they are doing to store; it seems like an object but there is also a befuddle .json file there somewhere. https://github.com/materialsproject/pymatgen They mention that they are in the middle of a major refactor, so maybe it will be better soon. Right now NOMAD seems a lot better structured to me.

EDIT: It seems that Materials Project data is a subset of NOMAD data. So NOMAD should suffice.....

tovrstra commented 3 years ago

I could not easily determine if pymatgen also has its own serialization like QCSchema or Nomad. It seems they mainly use existing formats and the REST API of materials project.

Nomad is indeed impressive. It seems extensive in principle but I'd have to try it to see how it works. Many of the entries in MetaInfo are not used yet in the database, so it is a bit difficult to see how it exactly works. Anyway, sure worth trying. It could be a good place to upload databases of QC results, which we use for benchmarking.