payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Error when trying to delete restart folder that is symbolic link #205

Closed rmholmes closed 4 years ago

rmholmes commented 4 years ago

This is a pretty minor issue but I'm getting an error when payu tries to remove a restart directory that is actually a symbolic link. The offending line is

https://github.com/payu-org/payu/blob/89375e9960c5769ac54e1c8dd89334a895cff891/payu/experiment.py#L755-L757

os.path.isdir(res_path) still seems to return True for symbolic links, despite the intention of the if statement:

[rmh561@raijin3 1deg_jra55_ryf]$ ls -lh fake_link
lrwxrwxrwx 1 rmh561 e14 38 Oct 31 19:33 fake_link -> /g/data/e14/rmh561/fake_link_location/
[rmh561@raijin3 1deg_jra55_ryf]$ python3
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:18:42)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.isdir('fake_link')
True

Hence I get a OSError: Cannot call rmtree on a symbolic link at the end of the run.

Suggested fix: Change the os.path.isdir(res_path) to (not os.path.islink(res_path))?

marshallward commented 4 years ago

I'm testing on my home desktop and it seems to report symlinks as directories.

$ mkdir a
$ ln -s a b
$ ls -lh b
lrwxrwxrwx 1 marshall users 1 Nov  3 10:03 b -> a
$ python
>>> import os; os.path.isdir('b')
True

This is very legacy code, and would be very unusual if it were simply symlink directories being misinterpreted as non-directories.

Can you add some more context on the problem here? It could be that res_path is a dead symlink, for example, for which isdir would report False. This could also be filesystem dependent, though even that seems unlikely to me.

rmholmes commented 4 years ago

Thanks Marshall,

Shouldn't it be reporting the sym-link as not a directory. I.e. you want False there. Thus, the if statement is will not be satisfied and it won't try to delete the symlink (according to Only delete real directories; ignore symbolic restart links)? The error for me comes on the shutil.rmtree line - it tries to remove the symlink and fails with OSError: Cannot call rmtree on a symbolic link.

[rmh561@raijin2 1deg_jra55_ryf]$ pwd
/short/e14/rmh561/access-om2/archive/1deg_jra55_ryf
[rmh561@raijin2 1deg_jra55_ryf]$ ls -lh restart059
lrwxrwxrwx 1 rmh561 e14 80 Nov  4 08:38 restart059 -> /short/e14/ql1694/access-om2/archive/1deg_jra55_ryf9091_qian_control/restart059/
[rmh561@raijin2 1deg_jra55_ryf]$ python3
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:18:42) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.isdir('restart059')
True
>>> import shutil
>>> shutil.rmtree('restart059')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/shutil.py", line 490, in rmtree
    onerror(os.path.islink, path, sys.exc_info())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07/lib/python3.6/shutil.py", line 488, in rmtree
    raise OSError("Cannot call rmtree on a symbolic link")
OSError: Cannot call rmtree on a symbolic link
>>>
rmholmes commented 4 years ago

To put it more simply, the intention of the if statement is to pass over symlinks and not try to delete them. This is exactly what would be achieved if os.path.isdir is replaced with not os.path.islink - my suggested change.

marshallward commented 4 years ago

Right, sorry. The comment contradicts the behaviour of the function, I did not catch that.

The change seems OK to me. But given that this is low level stuff, I think @aidanheerdegen should check that this change is correct.

The only thing I can suggest is perhaps do isdir() and not islink() though maybe even that is too defensive.

aidanheerdegen commented 4 years ago

I don't understand under what circumstances this occurs. It would be good to nail that down as it might be the result of a bug

rmholmes commented 4 years ago

The situation is this:

I am continuing a run from Qian and Andrew on the 1-degree RYF9091. To continue it I simply added links in my archive to their restart and output directories:

[rmh561@raijin2 1deg_jra55_ryf]$ pwd
/home/561/rmh561/access-om2/1deg_jra55_ryf
[rmh561@raijin2 1deg_jra55_ryf]$ ls -lh archive/*059*
lrwxrwxrwx 1 rmh561 e14 76 Oct 29 09:30 archive/output059 -> /g/data/hh5/tmp/cosima/access-om2/1deg_jra55_ryf9091_qian_control/output059/
lrwxrwxrwx 1 rmh561 e14 80 Nov  4 08:38 archive/restart059 -> /short/e14/ql1694/access-om2/archive/1deg_jra55_ryf9091_qian_control/restart059/

This works fine for the first four runs. However, after the fifth run payu tries to delete the restart059/ directory and fails (because its really a link, and shutil.rmtree doesn't work on links). The run is successful but since I've used payu run -n 20 and an error is thrown, the next job submission does not occur (meaning I have to start it manually again). If I delete the link manually of course it works fine.

It's quite possible that no one has run into this before because instead of adding links to the folders in archive/, they instead add those folders directly in config.yaml as input lines. This might be a better way to do it.

As I said, this is a minor issue. But nevertheless that if statement is clearly wrong and doesn't serve the purpose for which it was included.

rmholmes commented 4 years ago

@marshallward - I agree that isdir() and not islink() is better. I've changed it.

aidanheerdegen commented 4 years ago

This was an issue ages, that Paul noted, and this commit was supposed to explicitly fix (and I thought it had)

https://github.com/payu-org/payu/commit/4ff69ec7751b892fee5b51bab779b9f3fbbfe4af

I agree that for whatever reason it has popped up again.

@marshallward created the restart: /directory/path/to/restart option for config.yaml for this express purpose.

restart: /short/e14/ql1694/access-om2/archive/1deg_jra55_ryf9091_qian_control/restart059/

I see that there are manifests in here:

/home/561/ql1694/access-om2/1deg_jra55_ryf9091_qian_control

so another option is to specify a restart run, which will pick up the restarts from the restart manifest.

payu run -n 20 -r

which would re-run the 059 run. Or get Qian to do

payu setup
git commit -a -m 'Create new restart manifest for continuation run'

and git clone and then specify a restart run (-r), and it would then use the restarts from the 059 run.

Make sense?

aidanheerdegen commented 4 years ago

And yes, updating the docs is on my TO-DO list. Has been for a while!