Linking LOBSTER calculations to the underlying DFT calcualtions

ondracka commented 1 year ago

In the pull request #11 it was suggested to do:

archive_qe = archive.m_context.resolve_archive(f'../upload/archive/mainfile/{path}')
archive.system = archive_qe.run[0].system[0]

As a means of linking to the underlying DFT calculations. I thought I would try first linking with with VASP, where the mainfile is just vasprun.xml and that one is automatically generated in the directory when the vasp was executed (LOBSTER needs to run in the same directory afterwards).

So for example this would be correct?

archive_vasp = archive.m_context.resolve_archive('../upload/archive/mainfile/vasprun.xml')
archive.system = archive_vasp.run[0].system[0]

Now to test this I need a full nomad setup (like a working oasis), correct? If I just run locally nomad parse lobsterout in a directory where the lobsterout file and corresponding vasprun.xml is I end with nomad.metainfo.metainfo.MetainfoReferenceError: cannot retrieve archive PkzmvrnrhINXqYtkbSekZCxGqpdX from http://nomad-lab.eu/prod/v1/api/v1. So this can't be tested at a parser level only, or how should I do it?

BTW regarding the linking to QE, QE does not write to a standardized location. In fact the current QE DFT parser uses the QE stdout (which is usually redirected and saved somewhere, but that depends on the user). QE also writes xml output, but this is quite recent and parsing of it is not supported ATM by the electronic parsers. However assuming the output of the QE run was indeed saved and is in the directory somewhere, should I just try to do archive.m_context.resolve_archive for all files in the directory to see if I can hit the jackpot or how should I proceed?

ondracka commented 1 year ago

BTW CC @ladinesa

JosePizarro3 commented 1 year ago

Hi @ondracka ,

This is very interesting. I also started a couple of weeks ago trying to link some code with the underlying DFT calculation with the "automatically resolved workflow" idea in mind.

As you pointed out, you cannot test this with nomad parse, but rather with a local nomad installation and trying draggin files and printing things by the terminal (I don't think there is really any other way for this, but we can ask @markus1978 ).

There are other things:

You have to define level at the beginning of the parser class, similar to Phonopy. Thought I think it is better to define the level one step before (in the gitlab when the matching of the parser is executed).
Main issue here is how to reference the correct mainfile paths, as in an upload, there can be several mainfiles corresponding to the level=0 (DFT) and level=X (next task).

I am going to investigate this and open a merge in the electronic-parsers once I have something. We can keep contact with each other if you want, so we don't double work 🙂

ladinesa commented 1 year ago

you can skip the test for the workflow.
if this is not something standard, we should not write an automatic workflow generation. We run the risk of linking the incorrect calculation. It should be left to the user to generate the workflow.

JosePizarro3 commented 1 year ago

if this is not something standard, we should not write an automatic workflow generation. We run the risk of linking the incorrect calculation. It should be left to the user to generate the workflow.

I think workflows like this are kind of standard, what is not is the placement of files in the upload. In any case, we could leave a try for these kind of situations, guessing where the files have chances to be (like one folder up w.r.t to the next level).

ladinesa commented 1 year ago

But is the mainfile of the reference calc specified in the mainfile? What if there are a number of these files in different locations?

JosePizarro3 commented 1 year ago

But is the mainfile of the reference calc specified in the mainfile?

This is a challenge, indeed. Maybe (I am just starting to explore this) can we resolve it from the upload? Example: 1 DFT mainfile vasprun.xml, 4 GW mainfiles from yambo at different kgrids. The DFT is placed in the main dir, while the other 4 GWs in subfolders. There we could resolve it I think.

In some cases, codes output the original DFT code they come from (like this LOBSTER, right?).

What if there are a number of these files in different locations?

Then the try pops the exception, as we cannot predict people moving files too much around.

ladinesa commented 1 year ago

We generate only if we could find the correct number of reference files. My opinion is that we should be as conservative as possible when generating these workflows automatically. It is better not to have them than to have incorrect links. Is this not the case of xspectra, since it does not specify the starting point, we did not try to generate the workflow automatically?

JosePizarro3 commented 1 year ago

Well, xspectra is an easy case, as it is coming always from QE. In that case we just need to check a couple of things from the output of the DFT entry, hence we can locate which is the automatic workflow:

1 QE file for groundstate N QE file for excitedstate and 3N XSPECTRA files (there are 3 dipoles per core-hole).

In my opinion, if we know which files have the proper metainfo, we can resolve the automatic workflow. It is then a matter of properly parse SinglePoints and scan sections. Do you think this makes sense or in practice is better not to even try?

JosePizarro3 commented 1 year ago

A small note: this of course only works in the same upload, not across uploads**

ladinesa commented 1 year ago

We can implement automatic workflow generation but again, we should provide only link if we can uniquely identify reference calculations.

Yes of course only on the same upload. It is up to the user to link inter uploads tasks.

JosePizarro3 commented 1 year ago

Indeed, you are totally right. Only in safe situations where we can double check metadata.

nomad-coe / workflow-parsers

Linking LOBSTER calculations to the underlying DFT calcualtions #12