payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Existing experiment is not detected if archive does not exist #420

Closed aidanheerdegen closed 2 months ago

aidanheerdegen commented 4 months ago

If an existing experiment does not have an archive path but a legacy archive path exists the current logic will assume it is a legacy experiment.

I discovered this when I made a new experiment, incorrectly configured it when I ran, so did a payu sweep --hard to remove the unwanted outputs and start again.

For the logic contained here:

https://github.com/payu-org/payu/blob/b6db50b8502ebc7cfbaa9bc12d8920517f332ac7/payu/metadata.py#L178-L196

is_new_experiment == False
archive_path.exists() == False
legacy_archive_path.exists() == True

So the branch at https://github.com/payu-org/payu/blob/b6db50b8502ebc7cfbaa9bc12d8920517f332ac7/payu/metadata.py#L181 is the one taken.

As soon as I created archive_path for my new experiment payu takes the branch above and it works.

jo-basevi commented 3 months ago

Yeah as you have explained, the existence of archives to help determine whether its an legacy experiment or to use an branch-UUID experiment name, or if a new UUID needs to be generated. If payu sweep --hard was run in an experiment without an existing legacy archive, it would then create a new UUID and archive on subsequent calls to payu setup/run.

Solutions:

  1. payu sweep --hard does not delete archive folder but deletes the contents. Con: This leaves behind empty directories, new UUIDs won't be generated on this local copy of the branch again.
  2. Add another option to payu sweep that deletes outputs but leaves behind archive folder and metadata.
  3. Move the keep_uuid logic (which leaves the UUID unchanged if it exists) to before checking for legacy archives. Then could use payu checkout --keep-uuid $BRANCH_NAME post running payu sweep --hard to re-setup the archive. The --keep-uuid/-k option could be added to payu setup. Con: It's not that intuitive for users to use --keep-uuid/-k post running payu sweep --hard.
  4. Add a check whether the UUID in the metadata matches the UUID in the legacy archive's metadata. If the UUIDs don't match then, don't use the legacy archive. Con: This check only really works if legacy archive contains a metadata file with a UUID (i.e. if the legacy experiment has been run with new payu changes).

The issue to solution 1 is that new UUIDs won't automatically be generated on that branch even if its "starting" again. For example, say an experiment output was synced to remote directory, then payu sweep --hard was run. I think subsequent calls to payu setup/run on that branch should generate a new UUID, unless it was continuing the same experiment (starting from the latest restart). (Note to start from restarts in this case, can use payu checkout -k -r $RESTART_PATH $BRANCH_NAME to set up the archive again).

aidanheerdegen commented 3 months ago

Add a check whether the UUID in the metadata matches the UUID in the legacy archive's metadata. If the UUIDs don't match then, don't use the legacy archive. Con: This check only really works if legacy archive contains a metadata file with a UUID (i.e. if the legacy experiment has been run with new payu changes).

If the legacy doesn't contain a metadata.yaml file with a UUID then isn't that the same as UUIDs not matching?

I think subsequent calls to payu setup/run on that branch should generate a new UUID, unless it was continuing the same experiment (starting from the latest restart)

I agree. We can't tell the difference between an empty/missing archive, and we can't know why it might be missing or empty. As you say it is perfectly possible that a whole experiment is sync'ed to a different filesystem and the files in /scratch have been purged.

Almost the worst thing we can do is continue to use the UUID if there is a possibility that it belongs to an existing experiment. So unless expressly told otherwise, we should just generate a new UUID/experiment.

jo-basevi commented 3 months ago

An case of this happening could be if an legacy experiment was run for the first time with payu run/setup. So it'll find no UUID in metadata -> generate a new UUID -> finds a legacy archive -> uses legacy archive for experiment name.

Hmm though I guess the above case could be caught by checking whether a new UUID has been updated or not at that point..

if legacy archive exists and 
    (new UUID generated or (legacy_metadata exists and legacy_UUID exists and legacy_UUID == UUID)):
        use legacy archive
jo-basevi commented 3 months ago

Ok I only just realised the above logic would break if a legacy experiment has metadata file with a UUID but it has not been copied over to archive yet (so it hasn't been run with the latest payu changes). There could be a few cases like this as experiment_uuid was a pre-existing metadata field prior to the latest payu changes. So I think can only really check for not (not new UUID generated and legacy_metadata exists and legacy_UUID exists and legacy_UUID != UUID) which is:

Like I mentioned earlier, this will only really catch the case when legacy has already been run with payu setup/run with the latest payu changes. I think this is doing a good job of highlighting how error-prone using branches alongside legacy experiments can be..

aidanheerdegen commented 3 months ago

I think this is doing a good job of highlighting how error-prone using branches alongside legacy experiments can be..

Doh!

So I think if you have a solid improvement that catches more cases, or is consistent with more cases, then we should implement it even if there are corner cases that will still trip it up.

Once it's implemented it would be good to have a "recommendations for use with legacy experiments" short explainer. That could go in the docs or the added to the payu tutorial on the forum.