Closed aidanheerdegen closed 4 years ago
This has been solved:
It appears to be an issue with how the clients (i.e. compute nodes) cache the directories, in that the mv on one client doesn't cause directory entries on other clients to be invalidated. As a result, when you recreate a directory with the same name you can't see the new version from another client that already has the old entry cached – you need to wait for that old cache entry to be evicted (which would usually happen through memory pressure on the compute nodes). The reason it appears randomly between compute jobs is because it will depend whether you get a node that already has that path cached or not (i.e. one that a previous job ran on). We'll report this up to our Lustre support provider, but in the meantime I can see three possibilities: don't use the same name for different jobs – e.g. add a numeric suffix to the path that you increment for each run copy the directory into the archive folder and delete the source, rather than moving it wait for a period between jobs, so that existing cached entries been evicted by the time the second job runs Of those, I think the second is probably the easiest to implement. Indeed, it likely wouldn't even need to be a full copy of all the contents – just create a new directory in the archive folder, move the contents between the directories, and then delete the now-empty source.
@aidanheerdegen that's great!
So if it's solved let's close the issue. Will the solution that was implemented impose any change(s) in my normal payu behavior and the directory structure of output?
The problem is solved, but I've not uploaded the code fix yet. I made the issue so I had an electronic trail for the issue.
@navidcy MOM6 jobs error on resubmit. They look like this in
mom6.err
:and in
mom6.out
:The "zero" node is fine, but all the other execute nodes appear to have a stable version of the
work
directory, where the link to the executable has been deleted.