Parsing of .magres files

jkshenton commented 8 months ago

.magres files are output by both Quantum Espresso's GIPAW and CASTEP when NMR calculations are run. It would be great to have NOMAD be able to parse them.

A parser already exists in ASE, for reference: https://gitlab.com/ase/ase/-/blob/master/ase/io/magres.py

The specification of the file format exists here: https://www.ccpnc.ac.uk/docs/magres/magres-format.pdf

JosePizarro3 commented 8 months ago

Hi @jkshenton thanks a lot for getting in touch. We will happy to implement this.

Just to speed up the development, do you have any input/output example files at hand that we can directly have a look? I tried looking but couldn't find these examples. Furthermore, the more complete folder / files you can share with us, the better, as we can prepare the parsing and cross-reference with ASE :-)

jkshenton commented 8 months ago

Thanks for the super fast reply and for taking up this implementation!

CCP-NC has a repository of many thousands of such .magres files: https://www.ccpnc.ac.uk/database/

I grabbed a few random examples from there are put them in the attached tarball. The ethanol.magres file is a particularly comprehensive example.

magres_examples.tar.gz

Let me know if you would like any more information to make the implementation easier.

JosePizarro3 commented 8 months ago

Perfect, thanks a lot for the share.

This is very interesting, we were not aware of such an initiative. I will take a look on the details of the project and possibly come back to you with some questions, if that's fine. We can further discuss whether you want to share the data in the database in NOMAD, and how can we help each other with computational or experimental data.

jkshenton commented 8 months ago

We have recently been thinking about ways to make a version 2 of our NMR database more FAIR, including some integration/sharing with databases such as the NOMAD one, so we would be very happy to discuss this!

JosePizarro3 commented 8 months ago

Hi @jkshenton ,

I am coming back to this issue to let you know we are starting to work on the magnetic properties support in NOMAD (you can check a recent issue opened in #174).

I think before starting to work on the magres parser, it is a good idea if we can meet in Zoom, let's say 30min - 1h, so that we can understand your goals, how to merge the #174 idea with yours, and how NOMAD can help. Furthermore, I would like to discuss the workflows typically done in magres calculations, how to integrate this, and how is the data in the CCP-NC structured (and how it compares with NOMAD).

What do you think? We can also talk by email (jose.pizarro@physik.hu-berlin.de) and organize the meeting by private email. Whatever feels more comfortable for you 🙂

jkshenton commented 8 months ago

Happy to meet and discuss our goals - I've just sent an email to arrange that.

A bit more context here to help with a discussion:

magres files are a structured text file format that contains primarily a) a crystal structure and b) NMR-related results

The NMR-related results can include site-based (e.g. magnetic shielding and electric field gradient) tensors, pair-wise (e.g. J-couplings) tensors or global (e.g. magnetic susceptibility) tensors. Each quantity is reported along with its units.

The file can also have a [calculation] block that contains some metadata about the DFT parameters (e.g. XC functional) used.

Although a magres file in isolation is very useful for post-processing and sharing the results of first-principles solid-state NMR calculations, we would ideally like to provide more context in our (/your) database in the future. The workflow would typically be something like:

geometry optimisation of a crystal structure
SCF calculation
magres calculation

In terms of our goals: we're currently in the planning stage of a major re-development of our database stack and we're looking at different options to improve the value the database provides to the solid-state NMR community. This includes better search/filtering functionality, better metadata capture (including workflow context) and some data visualisation options through integration of some of our other python and javascript tools.

JosePizarro3 commented 8 months ago

Very good. I think the workflow can be covered with the current NOMAD infrastructure, albeit some details we can discuss over Zoom (like which files should be included in the upload for these).

In terms of our goals: we're currently in the planning stage of a major re-development of our database stack and we're looking at different options to improve the value the database provides to the solid-state NMR community. This includes better search/filtering functionality, better metadata capture (including workflow context) and some data visualisation options through integration of some of our other python and javascript tools.

Then, FAIRmat can help on this. I am speaking internally with some engineers to see whether they can join the discussion. But for a first meeting, we can definitely sit and see what are the best options; maybe, @ladinesa are you available for joining the discusion? If so, I will send you the emails for the Zoom.

ladinesa commented 8 months ago

Thanks for including me in the discussion. Is the date already set? I will be on holiday next week, so it would be great if we schedule it thia week.

ladinesa commented 8 months ago

https://github.com/dceresoli/qe-gipaw

JosePizarro3 commented 8 months ago

Brief summary of our meeting:

The main goal is to improve the website for CCP-NC NMR database: FAIR-compliant metadata, improved searchability (system information and NMR properties searches), and including visualizations.
There are several options, but without wanting to re-invent the wheel, we talked about using NOMAD as a platform for the FAIR metadata and searchability, and use the CCP-NC website for front-end. @ladinesa how do you envision this point? How should the CCP-NC web be used and be compatible with the central NOMAD?
Developing NOMAD in 4 main steps: 1) initial magres parser for inputs (system, method) and outputs (calculation), 2) link between magres and QuantumESPRESSO/CASTEP if the files are present in the datapoint, 3) add searchability (which methodological strings or numerical quantities can be defined?), defining properties.magnetic in NOMAD. 4) add app menu in Explore tab in NOMAD for NMR data. What about experimentalists? Can we convince some groups to use NOMAD? 5) add visualization, probably based on MagresView, 6) update the CCP-NC page with these changes (see previous point).
Give support to legacy data in their database. Add functionality for workflows QuantumESPRESSO/CASTEP -> magres.
Development could be done by Sanya / Kane, with help of Alvin / Jose.
After sharing more data and the report to the CCP tomorrow (01.11.2023), we will plan the next steps more in detail.

I think these bullet points summarize the meeting. Feel free to add or ask anything.

jkshenton commented 8 months ago

Thanks for sharing your summary! I think you captured the essential bits.

For the magresview visualiser, I would rather link to our custom "2.0" version which essentially completely replaces the previous JMOL-based version.

For the workflow / link between different DFT output files, I've attached a tarball with a very basic two step procedure that might be typical of the sort of ssNMR calculations with CASTEP that one might upload to NOMAD: 1. a geometry optimisation (seedname ethanol_geom) followed by 2. an NMR calculation (seedname ethanol_nmr). The latter produces a .magres file. castep_workflow_nmr.tar.gz

JosePizarro3 commented 8 months ago

Thanks a lot, this is indeed what is needed to fully develop the parser 👍🏻 If you have more examples, do not hesitate in sharing them with us; the more, the better, as this will help on preparing better other options.

Now, @jkshenton @ladinesa I was wondering about the workplan: I think, we (either Alvin or myself) can develop the initial version of the parser. Then, on the long term and if you are convinced of using NOMAD, it is better if you (or Sathya) take over maintaining the parser. I was very recently discussing with other devs, and you could even think in the more longer term about using the developed parser as an I/O wrapper for your applications (without the need of having scripts over the place).

Let me know what you think. If agreed, I'll suggest you to star this repository, and I will keep you informed of important changes that affect you.

P.S.: should we also tag Sathya's Github profile?

jkshenton commented 8 months ago

Your proposed workplan sounds good to me - thanks!

As we mentioned before, the broader context would be that we would like to be able to easily (=via dashboard/API) access NMR data from any of the DFT codes that compute it. These include (non-exhaustive list):

Parsing magres files is a very useful first step towards this, since they have been adopted by two of the major DFT NMR codes (CASTEP and QE) and the specification for the file format introduces the rationale behind the structure of key bits of NMR data. There's also an accompanying JSON schema , in case that is helpful.

In terms of using the nomad parser as an I/O wrapper - I am all for re-using code and well-built libraries, though I would note the ongoing development of a standalone CASTEP parsing library to play such a role: https://github.com/oerc0122/castep_outputs The idea behind that one is that it will be eventually integrated with the CASTEP code test suite/CI workflow and thereby (hopefully) maintained by the CASTEP developers.

Good idea to tag @Sathya-S3

JosePizarro3 commented 7 months ago

Very good. I will work on the schema and parser mid December. Sorry, I am going on holidays two weeks.

In terms of using the nomad parser as an I/O wrapper - I am all for re-using code and well-built libraries, though I would note the ongoing development of a standalone CASTEP parsing library to play such a role: https://github.com/oerc0122/castep_outputs The idea behind that one is that it will be eventually integrated with the CASTEP code test suite/CI workflow and thereby (hopefully) maintained by the CASTEP developers.

This is very interesting. We have to definitely join efforts here, as I don't see the point of maintaining several parsers for the same code and double the work 🙂 We will pay attention to when this is integrated in CASTEP, but in the meanwhile, @ladinesa do you mind checking the repo and seeing how it compares with our current CASTEP parser?

ladinesa commented 7 months ago

Very good. I will work on the schema and parser mid December. Sorry, I am going on holidays two weeks.

In terms of using the nomad parser as an I/O wrapper - I am all for re-using code and well-built libraries, though I would note the ongoing development of a standalone CASTEP parsing library to play such a role: https://github.com/oerc0122/castep_outputs The idea behind that one is that it will be eventually integrated with the CASTEP code test suite/CI workflow and thereby (hopefully) maintained by the CASTEP developers.

This is very interesting. We have to definitely join efforts here, as I don't see the point of maintaining several parsers for the same code and double the work 🙂 We will pay attention to when this is integrated in CASTEP, but in the meanwhile, @ladinesa do you mind checking the repo and seeing how it compares with our current CASTEP parser?

will create interface to it in #184 .

Sathya-S3 commented 7 months ago

Hi @ladinesa, I'm in the process of preparing a technical stack review document for the CCP-NC main working group. The goal is to present the different development options for the CCP-NC database website. It'd be valuable to know your thoughts as well, on the below section from @JosePizarro3's meeting notes, when time permits. Thank you very much.

The main goal is to improve the website for CCP-NC NMR database: FAIR-compliant metadata, improved searchability (system information and NMR properties searches), and including visualizations.

There are several options, but without wanting to re-invent the wheel, we talked about using NOMAD as a platform for the FAIR metadata and searchability, and use the CCP-NC website for front-end. @ladinesa how do you envision this point? How should the CCP-NC web be used and be compatible with the central NOMAD?

For reference @jkshenton

ladinesa commented 7 months ago

Hi @ladinesa, I'm in the process of preparing a technical stack review document for the CCP-NC main working group. The goal is to present the different development options for the CCP-NC database website. It'd be valuable to know your thoughts as well, on the below section from @JosePizarro3's meeting notes, when time permits. Thank you very much.

The main goal is to improve the website for CCP-NC NMR database: FAIR-compliant metadata, improved searchability (system information and NMR properties searches), and including visualizations.

There are several options, but without wanting to re-invent the wheel, we talked about using NOMAD as a platform for the FAIR metadata and searchability, and use the CCP-NC website for front-end. @ladinesa how do you envision this point? How should the CCP-NC web be used and be compatible with the central NOMAD?

For reference @jkshenton

I refer to the approach we took with the other databases supported in nomad e.g. materials project, aflow, oqmd. We would host your data in nomad and develop an app for a customised search of nmr data in central nomad. Regarding the ccp-nc website, you start with a nomad oasis deployment where you can further customise schema, visualisation etc. This will also enable the synching of data with nomad central. Depending on the long-term goals of the project, you can then migrate into an independent infrastructure similar to the databases I have mentioned providing only a link to the corresponding entry in nomad.

JosePizarro3 commented 6 months ago

Hi @jkshenton @Sathya-S3

Just wanted to say that I am almost finished with the initial version of the parser for magres. Just had a couple of minor doubts:

I saw that efg and isc can be partition into different contributions with an extra tag that will appear as efg_{tag} and isc_{tag}. I wanted to know whether there are other options than "local" and "nonlocal" for the potential, and "fc", "orbital_p", "orbital_d" and "spin" for the spin couplings. I guess not, but I want to make 100% sure.
Are there any first principles NMR parameters or approximations you would like to cover in the schema? I understand this approximation is linear response, but maybe for a near future we can think of fully track the provenance of the settings. I just wanted to start the topic, so there is no real need to answer this now.
Just editing to add something I forgot: how do you parse the xcfunctional into magres? I have some potential settings for the XC functional in CASTEP and QE, but just want to know whether we are in the same page here. I can share these in detail.

Thanks!

jkshenton commented 6 months ago

Hi! Exciting - thanks for working on it!

Yes, those are all the tags that I can see in the CASTEP source at least.
While there are many relevant parameters and approximation details that are definitely relevant to the NMR calculations (pseudopotentials, XC functional, k-point sampling, relativistic treatment etc.), I don't think there are particular ones that a user tinkers with. Maybe the key thing here is to mark these are coming from the GIPAW (Gauge Including Projector Augmented Waves) method. I have asked some experts in this field and will update this response depending on what they say.
So far we actually haven't actively parsed the calc_xcfunctional tag (we just store the lines as strings), but as I understand the CASTEP source, the first 'word' in the full xc_definition is what gets printed in the magres file. For QE, up until very recently, there was no XC functional information in their magres files. However, newer ones will have this information, following this commit. So their calc_xcfunctional will be the result of their get_dft_short() routine.

Hope that at least partially answers your questions (?).

JosePizarro3 commented 6 months ago

Great, thanks a lot. Let's then put the focus first on CASTEP, test it, and then extend the support for QE if you like it.

Yes, that would be interesting. I am very much interested on learning and developing the schema based on your(s) opinion. I found some VASP docu about NMR calculations that might shed some light on some extra parameters which might be interesting for NMR calculations (like DQ).
Ok, we are pretty much following libxc (just saw that CASTEP is not included in supporting codes) when coming to parse XC functional labels. But in the case of CASTEP, the NOMAD parser reads the .castep output and then do some mapping. Thus for CASTEP magres reads the input .param file and prints to the [calculation] block reading the first word?

Thanks once more! 🙂

JosePizarro3 commented 5 months ago

Hi @jkshenton @Sathya-S3 ,

I finished preparing a magres parser. I included the parsing of the quantities in your file format, and I managed to connect with the CASTEP i/o files if these are present in the upload.

I think it makes sense if you can check, with some examples, if the parser works as you think it should. Then, we can set up another meeting to tackle more seriously how to integrate this parsing into your database. From my side, I think the best would be to have for your database CCP-NC to be the front-end of whatever is stored in NOMAD from NMR, but I would be happy to hear your thoughts.

Sathya-S3 commented 4 months ago

@jryates Further to my email earlier today, I'm tagging you in this magres parser development thread to help move the conversation forward.

best wishes, Sathya.

jryates commented 4 months ago

Addressing a few comments further up the thread:

The J-coupling has a natural partitioning ("fc", "orbital_p", "orbital_d" and "spin") - so those exist in the magres-format. For other quantities there are other divisions one could propose - but these would depend on the details of the methodology used, and I think would be specific to a particular piece of analysis. It would seem impossible to account for all such possibilities. If I did code up some new partitioning scheme, I could get CASTEP to write out multiple magres files with different data - rather than invent new tags. This is long-winded way of saying that I think you have all the correct tags.
The data in the magres is independent of the methodology. But there are different approaches to calculating NMR parameters in different codes - and that could be recorded in the header. I have a recent draft of an article which summarises the different approaches available if that is helpful (I can’t post - but could share it via email).
CASTEP has very few exposed parameters for NMR calculations. The dq parameter in VASP was mentioned. CASTEP has this internally, but we don’t expose it, as there is typically no reason to change it.
CASTEP can link to libxc and use its functionals - this is fairly recent development. At the moment you have to specifically request it at compile time - and by default it will use CASTEP’s native xc routines. This may change in time. I guess we should check to see how the magres file handles a libxc defined functional.

jryates commented 4 months ago

From tests it seems that Nomad ignores the symmetry in the magres file - and recalculates it.
I realised that magres file does not specify the relativistic approach used - and that is an oversight I should fix.

Sathya-S3 commented 4 months ago

Thank you for your work on including the magres parser to NOMAD. The magres parser looks and works seamlessly. The parsing speed was quite quick, it took only a couple of seconds for each upload. We tested the magres parser with a few sample magres uploads (test upload, but not published) - one special inorganic material 'wadsleyite' and a well-known inorganic material 'coesite' (where we tested two variations of symmetry information in the magres file).

Comments and questions from testing

Adding to @jryates' comments above, we noticed from the coesite example that, even when we changed the symmetry information in the [atoms][/atoms] block, NOMAD ignored this change and still managed to work out the symmetry based on the crystal structure information.
We were happy that the metadata from [calculation][/calculation] block and the data from [magres][/magres] were extracted and grouped correctly in the DATA section of the upload.
Magnetic shielding and Electric field gradient data are extracted and displayed as expected. We noticed that both data are reachable starting from the results or from run on the Entry section. I have some questions about the data format and display... 3.1. I noticed that magnetic shielding values from magres files are scaled by 10^-6 because the unit is ppm, which is fine. For efg, the magres file values were scaled by ~0.01028 to convert from atomic units to V/Å^2 - is it the standard unit for representing similar parameters in the 'electronic' category in NOMAD? May I know, if we try to export the values back to a magres file, will these values be converted back to the same units represented in magres files (I don't think I can test this without publishing the dataset)? 3.2. The tensor representation in NOMAD of the ms and efg parameters is transposed. Could this be changed or is there a legacy reason why it is presented that way? 3.3. For the [nx3x3] tensors for ms and efg, the data labels in the 'value' section of DATA go from 0 to n-1. It is not easy to identify which atoms they correspond to without having the magres file open by the side. Is it possible to display the actual atomic labels as 'H1', 'O5' to indicate the first H atom, the fifth O atom, etc., for example?

Many thanks in advance.

EDIT: Attaching the magres files we used for the test, for your reference. magres_parser_check.zip

JosePizarro3 commented 4 months ago

@jryates @Sathya-S3

Thank you very much for testing the changes and giving feedback. Also, sorry for the long reply, I would like to comment 3 main things which directly affect you.

New NOMAD plugins structure

NOMAD will become more modular, so that people can develop independent packages (or plugins) and use them in their own installations or in the central one after approval. This means that:

We are on the process of refactoring the current NOMAD data schema (the sections and quantities you checked on the DATA menu). This is mainly because it became quite cumbersome to modify and maintain certain steps. You can find the new schema being developed in its own Github repo, but please, bear in mind that this is in a pre-alpha stage. We calculate that by April/May, there should be an initial version of the new data schema.
Related with this, parsers are going to be soon moved to its own independent repos. I will let you know once we move the magres parser to its own repo. I think this will be better overall when checking on the latest changes, and maybe you find it interesting to develop as well 🙂

Answering questions by @Sathya-S3 and @jryates

About the symmetry, NOMAD uses a package called MatID to classify and extract symmetry information. Am I understanding correctly that the symmetry was extracted properly by NOMAD, or due to missing these pieces of information, was it not?

3.1. I noticed that magnetic shielding values from magres files are scaled by 10^-6 because the unit is ppm, which is fine. For efg, the magres file values were scaled by ~0.01028 to convert from atomic units to V/Å^2 - is it the standard unit for representing similar parameters in the 'electronic' category in NOMAD? May I know, if we try to export the values back to a magres file, will these values be converted back to the same units represented in magres files (I don't think I can test this without publishing the dataset)?

So the units in NOMAD are defined based on the S.I., and we handle Quantities following pint. Units can be then changed by multiplying with ureg.<desired_unit>.

You can test your uploads using NORTH in NOMAD. This allows you to launch a Jupyter notebook directly in a folder where you can find your uploaded data. Maybe @ladinesa can tell you the exact details on importing the MagresParser and use the parse() function in there.

3.2. The tensor representation in NOMAD of the ms and efg parameters is transposed. Could this be changed or is there a legacy reason why it is presented that way?

You are totally right, thanks for spotting this. It is clearly a mistake from my side, I will fix it asap 🙂

3.3. For the [nx3x3] tensors for ms and efg, the data labels in the 'value' section of DATA go from 0 to n-1. It is not easy to identify which atoms they correspond to without having the magres file open by the side. Is it possible to display the actual atomic labels as 'H1', 'O5' to indicate the first H atom, the fifth O atom, etc., for example?

Very good point. However, as we work with pint.Quantity, it is not possible to define strings and floats at the same level. There are tho a couple of alternatives we can explore:

Short term: we patch it by including a list magnetic_shielding.atom_labels (and all the other NMR quantities) which contains the atoms to which the first index makes reference. I can also improve the description.
Your idea which is better in the long term: magnetic_shielding and the others become a list of 3x3 tensors. For each element, we have magnetic_shielding[i].value, magnetic_shielding[i].isotropic_value, magnetic_shielding[i].atom_label (note the singular in "label"). This is more long term in the sense that, we can even improve on the atom_label with the new data schema I mentioned above.

You can let me know what you think. A screenshot or demo of option 2 might be better to fully get the idea 🙂

CCP-NC and NOMAD

We should maybe meet and talk of solutions to work from both databases. I have some feedback from other NOMAD devs, and I think we can talk very nice options.

Let me know if you want to meet, and when.

Sathya-S3 commented 3 months ago

Hi @JosePizarro3, thank you for the detailed responses to our questions and additional new information on NOMAD platform's development direction.

We are on the process of refactoring the current NOMAD data schema (the sections and quantities you checked on the DATA menu). This is mainly because it became quite cumbersome to modify and maintain certain steps. You can find the new schema being developed in its own Github repo, but please, bear in mind that this is in a pre-alpha stage. We calculate that by April/May, there should be an initial version of the new data schema.

We'll keep watching the link for updates.

Related with this, parsers are going to be soon moved to its own independent repos. I will let you know once we move the magres parser to its own repo. I think this will be better overall when checking on the latest changes, and maybe you find it interesting to develop as well 🙂

Yes, definitely. We look forward to directly being involved in further parser development.

To the part about our initial questions,

Am I understanding correctly that the symmetry was extracted properly by NOMAD, or due to missing these pieces of information, was it not?

Yes it was extracted correctly, even when we deliberately entered incomplete symmetry information in the magres header. I think @jryates' and my comment really was that magres file's symmetry information is ignored my NOMAD. Down the line, it might be desirable to use magres symmetry information to calculate symmetry as an extra validation check?

You can test your uploads using NORTH in NOMAD. This allows you to launch a Jupyter notebook directly in a folder where you can find your uploaded data. Maybe @ladinesa can tell you the exact details on importing the MagresParser and use the parse() function in there.

I'm keen to test this further and will set some time aside for this. I'll wait first to see if @ladinesa has more information to add as you suggest.

Your ideas for the atom labels, both short and long term, sound good. Please let me know if I can be of help either with the development or by providing periodic feedback during development.

CCP-NC and NOMAD

I have a positive response from the CCP-NC working group to proceed talks with NOMAD about our partnership. I'll prepare a list of technically focussed questions surrounding CCP-NC data in a NOMAD supported database. My colleagues from Physical Sciences Data Infrastructure (PSDI) also have data-centric and logistical questions of their own to add. We'll aim to get these questions to you within a week's time.

I'm aiming to arrange a sit-down between your team and us (me + PSDI) in the first instance, some time next week. If your team members are on holiday next week (around Easter time), we can aim to block a time slot for the week after. We can deal with the meeting specifics through email.

JosePizarro3 commented 3 months ago

Yes it was extracted correctly, even when we deliberately entered incomplete symmetry information in the magres header. I think @jryates' and my comment really was that magres file's symmetry information is ignored my NOMAD. Down the line, it might be desirable to use magres symmetry information to calculate symmetry as an extra validation check?

Ok, that sounds good. But I need to understand a bit better how to do this validation, and how the symmetry operations compare with the MatID. Give some days, and perhaps I will even write you some email with more specific questions.

And perfect about the positive response 🥳 I am happy we can further collaborate and improve both NOMAD and the CCP-NC. I will let some colleagues know once you send me the questions and invitation for the Zoom, it might require some other expertises that @ladinesa or I do not have 🙂

JosePizarro3 commented 3 months ago

Just a short follow-up:

I fixed the transpose and added a patch to include the atoms information (both label and index). This will take a bit of time to be in the Beta version of NOMAD, so give it one week before trying it out. This is how it looks:

Screenshot from 2024-03-27 15-06-12

I created a repo for magres to start moving the parser. I will let you know once there is an initial version there. This can be found in nomad-coe/nomad-parser-magres. Please, note that now is just a plain fork without the magres parser on it.

nomad-coe / electronic-parsers