payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 26 forks source link

MOM6 run errors on resubmit #238

Closed aidanheerdegen closed 4 years ago

aidanheerdegen commented 4 years ago

@navidcy MOM6 jobs error on resubmit. They look like this in mom6.err:

mpirun was unable to launch the specified application as it could not access
or execute an executable:

Executable: /scratch/v45/aph502/mom6/work/HenkTestRun/MOM6-25Jan2020
Node: gadi-cpu-clx-2720

while attempting to start process rank 288.

and in mom6.out:

952 total processes failed to start

The "zero" node is fine, but all the other execute nodes appear to have a stable version of the work directory, where the link to the executable has been deleted.

aidanheerdegen commented 4 years ago

This has been solved:

It appears to be an issue with how the clients (i.e. compute nodes) cache the directories, in that the mv on one client doesn't cause directory entries on other clients to be invalidated. As a result, when you recreate a directory with the same name you can't see the new version from another client that already has the old entry cached – you need to wait for that old cache entry to be evicted (which would usually happen through memory pressure on the compute nodes). The reason it appears randomly between compute jobs is because it will depend whether you get a node that already has that path cached or not (i.e. one that a previous job ran on). We'll report this up to our Lustre support provider, but in the meantime I can see three possibilities: don't use the same name for different jobs – e.g. add a numeric suffix to the path that you increment for each run copy the directory into the archive folder and delete the source, rather than moving it wait for a period between jobs, so that existing cached entries been evicted by the time the second job runs Of those, I think the second is probably the easiest to implement. Indeed, it likely wouldn't even need to be a full copy of all the contents – just create a new directory in the archive folder, move the contents between the directories, and then delete the now-empty source.

navidcy commented 4 years ago

@aidanheerdegen that's great!

So if it's solved let's close the issue. Will the solution that was implemented impose any change(s) in my normal payu behavior and the directory structure of output?

aidanheerdegen commented 4 years ago

The problem is solved, but I've not uploaded the code fix yet. I made the issue so I had an electronic trail for the issue.