glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 45 forks source link

Underlay/overlay support is not tested, but used during job execution #250

Open rynge opened 1 year ago

rynge commented 1 year ago

We are seeing job failures in the OSPool due to a site not having overlay/underlay configured correctly:

INFO  Discarding path '/none'. File does not exist
INFO  Discarding path '/ceph'. File does not exist
INFO  Discarding path '/hdfs'. File does not exist
INFO  Discarding path '/lizard'. File does not exist
INFO  Discarding path '/mnt/hdfs'. File does not exist
WARNING: No layer in use (overlay or underlay), check your configuration, Singularity can't create /hadoop destination automatically without overlay or underlay
FATAL:   container creation failed: mount /hadoop->/hadoop error: while mounting /hadoop: destination /hadoop doesn't exist in container

The main problem here is that GWMS is using a feature of Singularity which was not tested during Singularity detection. A simple test like -B $PWD:/doesnotexist would probably have been enough to avoid this.

A secondary problem is that GMWS exec's Singularity, which means that $_CONDOR_WRAPPER_ERROR_FILE does not get updated. The job is thus marked as a user job failure instead of a wrapper failure (which would have restarted the job somewhere else). There might be a way to configure this behavior, but I have not found it yet.

mambelli commented 1 year ago

This is a bit tricky to test since the VO/Job can control the image and the bind mount options. So all the mounts may be there in the test (no need for overlay/underlay) and things may change in the job. I could always test for underlay/overlay but that would exclude the possibility to run w/ images w/ all the mount points and no overlay/underlay enabled. Should we add a flag to allow VO to require it?

And the problem w/ not exec-ing is that signal propagation would not work well causing run-away processes when jobs are killed