glideinWMS / glideinwms

The glideinWMS Project
http://tinyurl.com/glideinwms
Apache License 2.0
16 stars 46 forks source link

Hosted CE module loading / singularity detection weirdness #395

Closed mambelli closed 9 months ago

mambelli commented 9 months ago

Describe the bug A hosted CE is failing to detect apptainer. From the detailed logs job.3610066.0.err.txt job.3610066.0.out.txt , apptainer runs correctly (EC 0), but fails the validation of the output. Possible causes are:

There could also be a problem with the image or Apptainer execution This behavior requires further investigation

To Reproduce Apptainer is in tmp/glide_Pb9H5L/main/singularity_setup.sh invoked w/ the command:

"/cm/local/apps/apptainer/current/bin/singularity" -vvv -d  exec --home "/tmp/glide_Pb9H5L":/srv --pwd /srv  --ipc --contain  --pid --bind "/etc/hosts,/etc/localtime"  "/tmp/glide_Pb9H5L/images/htc__rocky__9.sif" cat /proc/self/uid_map

And the result is:

INFO  Singularity at '/cm/local/apps/apptainer/current/bin/apptainer' failed (ec:0) w/ unexpected output

Expected behavior It is unexpected to have EC 0 and not the correct output

Screenshots

Info (please complete the following information):

Additional context

mambelli commented 9 months ago

To patch a 3.10.5 or 3.10.6 Factory, replace the content of /var/lib/gwms-factory/web-base/singularity_lib.sh with https://raw.githubusercontent.com/mambelli/glideinwms/v310/i395_apptainer_test/creation/web_base/singularity_lib.sh and run a Factory upgrade. This will allow to use apptainer/singularity as long as it returns exit code 0 and provides further output for troubleshooting.

mambelli commented 9 months ago

I started PR #396 to work on the Issue. I'd like some further troubleshooting before merging a final solution, but the current content (patch above) will get you going.

mambelli commented 9 months ago

I updated PR #396 to fix the Issue. It was not considering the case when uid_map had no initial blank. The link above points to the updated solution, with fixed validation and failure of the test when the output is wrong. You can use it for patching until the next release.

mmascher commented 9 months ago

Not re-opening this just yet, but a T0 operator reported an issues with "paused jobs":

image

This are the job logfiles: http://mmascher.web.cern.ch/mmascher/Job_1080.tar.bz2

I will investigate more tomorrow morning

LinaresToine commented 9 months ago

Thank you @mmascher

mmascher commented 9 months ago

FYI: After some investigation the issue with the T0 jobs does not seem related to this patch