pyiron / pyiron_atomistics

pyiron_atomistics - an integrated development environment (IDE) for atomistic simulation in computational materials science.
https://pyiron-atomistics.readthedocs.io
BSD 3-Clause "New" or "Revised" License
44 stars 15 forks source link

Wrap Vasp collect in try/except #1347

Closed pmrv closed 7 months ago

pmrv commented 8 months ago

Currently this breaks restarting any vasp job that did not timeout but aborted due to something else.

I still need to double check that this catches the correct error.

pmrv commented 8 months ago

@samwaseda It seems you made the original change to support recollecting when jobs timed out. I've added somewhat more general tools to contrib to do this for any job.

coveralls commented 8 months ago

Pull Request Test Coverage Report for Build 8523791503

Details


Totals Coverage Status
Change from base Build 8507305606: 0.05%
Covered Lines: 14239
Relevant Lines: 15260

💛 - Coveralls
samwaseda commented 8 months ago

Shouldn't it at least say something when the collecting fails?

pmrv commented 8 months ago

I'm looking at it. But also this fix should only be done if job.status.collect, right? If it times out before that, likely the actual calculation will be aborted as well or am I missing something?

ligerzero-ai commented 8 months ago

LGTM,

there's some weirdness going on in the code below with silent toggling of the magnetic flags on a restart... just a note for myself in the future

jan-janssen commented 8 months ago

Once this is merged, we should also release a new pyiron_atomistics version.

pmrv commented 8 months ago

Add test cases with "broken" files. Apparently @ahmedabdelkawy will provide the files.

pmrv commented 7 months ago

@ahmedabdelkawy Do you have some outputs of broken VASP runs ready? If not, I would move this to an issue and merge this anyway.

ahmedabdelkawy commented 7 months ago

If I remember correctly, and please correct me if I am wrong @pmrv, this change by @samwaseda was based on the fact that when jobs are ABORTED (for some reason), it is not collected. When one tries to restart another job from this aborted job, this fix would automatically collect the job?

pmrv commented 7 months ago

Essentially there are two cases:

  1. the job aborted because of a timeout on the queue, but at a stage where VASP was already finished;
  2. it aborted for any other reason

(1) implies that it would be possible to parse the output. This was done in the fix by @samwaseda. But this fix broke restart in case (2). This PR fixes that in turn. So we would need two test cases, one that is possible to be parsed, so any finished calculation would do, likely even the ones that we already ship with the test suite. And another one that contains any aborted VASP run as long as our parser cannot parse it. This test case should then check that it is possible to call restart on such a job.

jan-janssen commented 7 months ago

Please never upload POTCAR files, those are licensed by VASP and we are not allowed to share those.

pmrv commented 7 months ago

I've rebased to rewrite the commit that introduced a POTCAR, this way it won't even be accessible in the git history.

pmrv commented 7 months ago

@samwaseda I've noticed that your original fix doesn't change the job status after the collect. Is that intended? I don't care either way, I just want to confirm.

pmrv commented 7 months ago

@jan-janssen Imo this is done now and we can move ahead with the next version.