Wrap Vasp collect in try/except - Githubissues

pyiron / pyiron_atomistics

pyiron_atomistics - an integrated development environment (IDE) for atomistic simulation in computational materials science.

https://pyiron-atomistics.readthedocs.io

BSD 3-Clause "New" or "Revised" License

44 stars 15 forks source link

Wrap Vasp collect in try/except #1347

Closed pmrv closed 7 months ago

pmrv commented 8 months ago

Currently this breaks restarting any vasp job that did not timeout but aborted due to something else.

I still need to double check that this catches the correct error.

pmrv commented 8 months ago

@samwaseda It seems you made the original change to support recollecting when jobs timed out. I've added somewhat more general tools to contrib to do this for any job.

coveralls commented 8 months ago

Pull Request Test Coverage Report for Build 8523791503

Details

7 of 7 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.05%) to 93.309%

Totals
Change from base Build 8507305606:	0.05%
Covered Lines:	14239
Relevant Lines:	15260

💛 - Coveralls

samwaseda commented 8 months ago

Shouldn't it at least say something when the collecting fails?

pmrv commented 8 months ago

I'm looking at it. But also this fix should only be done if job.status.collect, right? If it times out before that, likely the actual calculation will be aborted as well or am I missing something?

ligerzero-ai commented 8 months ago

LGTM,

there's some weirdness going on in the code below with silent toggling of the magnetic flags on a restart... just a note for myself in the future

jan-janssen commented 8 months ago

Once this is merged, we should also release a new pyiron_atomistics version.

pmrv commented 8 months ago

Add test cases with "broken" files. Apparently @ahmedabdelkawy will provide the files.

pmrv commented 7 months ago

@ahmedabdelkawy Do you have some outputs of broken VASP runs ready? If not, I would move this to an issue and merge this anyway.

ahmedabdelkawy commented 7 months ago

If I remember correctly, and please correct me if I am wrong @pmrv, this change by @samwaseda was based on the fact that when jobs are ABORTED (for some reason), it is not collected. When one tries to restart another job from this aborted job, this fix would automatically collect the job?

Now the scenarios that I can think of for these aborted jobs: are either a timeout error, some other error that is code-specific, or a segmentation error.
In my experience, because of how the parsers are written, if you manually run collect on an aborted job, sometimes the parser will be able to collect what it can (with maybe some missing info), or it will raise an error maybe because of unexpected digits/letters.
The reason I am asking this is that I need a scenario to generate files that needs to be tested, otherwise it is too generic and there are too many cases.
On a side note, I have not yet fixed the signal capturing from the queuing system upon runtime error.
I would suggest we merge it as it is until further discussed.

pmrv commented 7 months ago

Essentially there are two cases:

the job aborted because of a timeout on the queue, but at a stage where VASP was already finished;
it aborted for any other reason

(1) implies that it would be possible to parse the output. This was done in the fix by @samwaseda. But this fix broke restart in case (2). This PR fixes that in turn. So we would need two test cases, one that is possible to be parsed, so any finished calculation would do, likely even the ones that we already ship with the test suite. And another one that contains any aborted VASP run as long as our parser cannot parse it. This test case should then check that it is possible to call restart on such a job.

jan-janssen commented 7 months ago

Please never upload POTCAR files, those are licensed by VASP and we are not allowed to share those.

pmrv commented 7 months ago

I've rebased to rewrite the commit that introduced a POTCAR, this way it won't even be accessible in the git history.

pmrv commented 7 months ago

@samwaseda I've noticed that your original fix doesn't change the job status after the collect. Is that intended? I don't care either way, I just want to confirm.

pmrv commented 7 months ago

@jan-janssen Imo this is done now and we can move ahead with the next version.