Open Simonassaf opened 3 months ago
Discussion on nf-core/sarek Slack about a seemingly very similar issue:
https://nfcore.slack.com/archives/CGFUX04HZ/p1718984599684549
Hey! Something seems to be wrong in a weird way.I see that at least 40 tasks finished successfully. So it's not generally wrong, but borks in some cases. Are you able to retrieve on which compute nodes the jobs ran? is there a particular node that has issues?
I have been working with @Simonassaf on investigating this. There were 2 runs done on the head node of an SGE cluster. Both of them failed exactly after 40 samples.
As a workaround suggested by @asp8200 on slack I have edited the pipeline and just removed the call to extract the version from bwa
and hardcoded it instead.
Weirdly enough that seems to be working. It is really strange that only calls to get the version fail and not the actual command itself.
We will do some test runs soon both running everything on a single compute node and using the sge
executor to see if it this failure pops as well.
I have been working with @Simonassaf on investigating this. There were 2 runs done on the head node of an SGE cluster. Both of them failed exactly after 40 samples.
As a workaround suggested by @asp8200 on slack I have edited the pipeline and just removed the call to extract the version from
bwa
and hardcoded it instead. Weirdly enough that seems to be working. It is really strange that only calls to get the version fail and not the actual command itself.We will do some test runs soon both running everything on a single compute node and using the
sge
executor to see if it this failure pops as well.
It is indeed very weird. I don't think we ever got to the bottom of the problem, but I'll ask my colleagues.
When running the pipeline over multiple nodes using the sge
executor this issue does not occur.
So maybe it is triggered by running with singularity on a single node?
I got that problem again. This time with Sarek/3.4.4 (test_full_germline
with singularity
):
ERROR ~ mapping values are not allowed here
in 'reader', line 2, column 13:
bwa: bwa: error while loading shared lib ...
^
-- Check script '/ngc/tools/ngctools/sarek/3.4.4/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/utils_nfcore_sarek_pipeline/../../nf-core/utils_nfcore_pipeline/main.nf' at line: 106 or see '.nextflow.log' file for more details
It used to be CUSTOM_DUMPSOFTWAREVERSIONS
would fail due to garbage (error messages) in some versions.yml
-files, and now I believe something similar might be happening with the functionality that replaced CUSTOM_DUMPSOFTWAREVERSIONS
.
Most of the versions.yml
-files for the BWA-jobs look like this:
"NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM1_MEM":
bwa: 0.7.18-r1243-dirty
samtools: 1.20
which seems fine to me, but I found one that look like this:
"NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM1_MEM":
bwa: bwa: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
samtools: samtools: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory
And that versions.yml
-file is from a BWA-job which Nextflow claims finished successfully 😬
@maxulysse : I thought the new "versions"-functionality was immune to those kinds of malformed versions.yml
-files?
@pontus and @adamrtalbot : I'm thinking this issue might also be of interest to you.
I'm slightly worried about the fact that the calls to bwa
and samtools
fail in the "version section" of the bwa-module while the corresponding bwa-job still completes successfully (at least according to Nextflow). The pipeline only fails because the versions.yml
-file is malformed - does that mean that bwa
and samtools
could have somehow failed "silently" in other ways (not affecting the "versions section" but perhaps the alignment itself 🫣 ) in other runs?
Can anyone trigger this at will? In that case it would be interesting to see. I recall some rather weird instances of this happening but generally, my first guess as to why would be some cluster where singularity is unable to mount the image directly but rather unpacks it onto some file system that behaves oddly.
Can anyone trigger this at will? In that case it would be interesting to see. I recall some rather weird instances of this happening but generally, my first guess as to why would be some cluster where singularity is unable to mount the image directly but rather unpacks it onto some file system that behaves oddly.
The error pops up once in a while on our HPC. Today it wasn't BWA but VEP that triggered the error:
ERROR ~ mapping values are not allowed here
in 'reader', line 2, column 21:
ensemblvep: perl:
^
-- Check script '/ngc/tools/ngctools/sarek/3.4.4/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/utils_nfcore_sarek_pipeline/../../nf-core/utils_nfcore_pipeline/main.nf' at line: 106 or see '.nextflow.log' file for more details
The versions.yml
for VEP is malformed and looks like this:
"NFCORE_SAREK:SAREK:VCF_ANNOTATE_ALL:VCF_ANNOTATE_ENSEMBLVEP:ENSEMBLVEP_VEP":
ensemblvep: perl:
I'm running test_full_germline
on sarek/3.4.4, jdk/21.0.1, NF 24.04.3, and singularity/4.0.2.
We thought it might have something to do with too many jobs being submitted at the same time to pbs/torque, but now it occurred with queueSize = 5
. This is the executor-config:
executor {
name = 'pbs'
queueSize = 5
pollInterval = '30 sec'
submitRateLimit = '3 sec'
exitReadTimeout = '600 sec'
}
The same job can produce a valid versions.yml
on qsub .command.run
🤔
@pontus : The problem with error messages getting written in the versions.yml
-files goes away if I (1) hardcode the version-numbers in the versions.yml
-generating code in the modules or (2) move the versions.yml
-generating code in the very top of the script-section of the modules.
I think this whole issue is due to instability on our HPC, but it would be nice if Sarek was somehow setup to handle such instability in a better way. If a malformed versions.yml
is written, then the run of the pipeline does eventually fail, but it is the Nextflow-job that is crashing with a msg like
ERROR ~ mapping values are not allowed here
in 'reader', line 2, column 13:
bwa: bwa: error while loading shared lib ...
(I think the module-job itself should have crashed.)
Restarting the pipeline with -resume
doesn't help; the pipeline will fail with the same message, because NF thinks the module was executed correctly and so it doesn't re-run the module-job.
The instability on our HPC also results in a lot of errors like the following which is from the execution of the nf-modules in Sarek:
samtools: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory
Those kinds of errors cause the execution of the module-job to fail, and so the module-job gets "retries". I'd like to see the same behaviour from NF/Sarek if the error occurs during the execution of the versions.yml
-generating code-snippet of the modules.
It's likely is the case that the versions handling code doesn't do the right thing, exactly visavi-errors.
On the other hand, there's only so much that can be done to manage broken environments and getting it fixed for all modules and making everything behave seems non-trivial.
Instead, my recommendation would be to as far as possible make you suffer less from when your cluster is unstable. I've already mention that this looks to me like a setup where the images are unpacked rather than mounted, is that the case? If so, I'd recommend fixing that (apptainer has an installer for a non-privileged version that will do the right thing for most current systems).
I've already mention that this looks to me like a setup where the images are unpacked rather than mounted, is that the case? If so, I'd recommend fixing that (apptainer has an installer for a non-privileged version that will do the right thing for most current systems).
How do I tell if "images are unpacked rather than mounted"?
From our custom.config:
singularity {
enabled = true
autoMounts = true
}
I am guess that that means the images are mounted, no?
No, that's not it.
Since I guess this is on HPC, the first thing is to figure out what you run.
Once you know that, you can try e.g. shelling in to an image or running or whatever, either it may be obvious from there, but if not you can add -d
(as in e.g. apptainer -d shell docker://alpine
).
No, that's not it.
Since I guess this is on HPC, the first thing is to figure out what you run.
Once you know that, you can try e.g. shelling in to an image or running or whatever, either it may be obvious from there, but if not you can add
-d
(as in e.g.apptainer -d shell docker://alpine
).
Thanks for your reply, Pontus. I'm afraid I still don't understand what you mean by "images are unpacked rather than mounted". I did find the singularity exec
-command in the .command.run
-file for one of the failing jobs:
nxf_launch() {
set +u; env - PATH="$PATH" ${TMP:+SINGULARITYENV_TMP="$TMP"} ${TMPDIR:+SINGULARITYENV_TMPDIR="$TMPDIR"} ${NXF_TASK_WORKDIR:+SINGULARITYENV_NXF_TASK_WORKDIR="$NXF_TASK_WORKDIR"} SINGULARITYENV_NXF_DEBUG="${NXF_DEBUG:=0}" singularity exec --no-home --pid -B /ngc -B "$NXF_TASK_WORKDIR" /ngc/shared/nf_tools/containers/depot.galaxyproject.org-singularity-strelka-2.9.10--h9ee0642_1.img /bin/bash -c "cd $NXF_TASK_WORKDIR; eval $(nxf_container_env); /bin/bash /ngc/projects2/bio/people/andped/test/runs/19/work/61/15ed6016524b3407536d99be8821f5/.command.run nxf_trace"
}
Is it perhaps clear from that singularity-command whether images are unpacked or mounted?
Sorry, no, it's not visible from the command.
Did you have any beforeScript
in place? Your system may have one version of singularity installed "directly" and another you get when you load a module.
If you replicate that, you can use the above but with apptainer substituted for singularity.
Sorry, no, it's not visible from the command.
Did you have any
beforeScript
in place? Your system may have one version of singularity installed "directly" and another you get when you load a module.If you replicate that, you can use the above but with apptainer substituted for singularity.
Yes, we're using a beforeScript
which loads a module with singularity.
beforeScript = "module load ngctools jdk/21.0.1 singularity/4.0.2;"
On our login-node:
$module load singularity/4.0.2
$ singularity -d shell /ngc/shared/nf_tools/containers/depot.galaxyproject.org-singularity-strelka-2.9.10--h9ee0642_1.img
DEBUG [U=912100779,P=17093]persistentPreRun() Singularity version: 4.0.2
DEBUG [U=912100779,P=17093]persistentPreRun() Parsing configuration file /ngc/tools/dtutools/singularity/4.0.2/etc/singularity/singularity.conf
...
DEBUG [U=912100779,P=19412]sylogBuiltin() Running action command shell
DEBUG [U=912100779,P=17093]PostStartProcess() Post start process
Singularity>
That command prints a lot of debug-statements to stdout. Not sure what I'm looking for, but it contains the string "mount" 359 times, while it doesn't contain the strings "pack", "compr" or "zip" 😁 (I guess I could attach the full log here - if need be.)
Ok, good. The question then is if you have a line like
DEBUG [U=1000,P=2992335] Mount() Executing FUSE mount command: "/usr/local/singularity/no-osuid/libexec/singularity/bin/squashfuse_ll -` ...
or similar, the "Executing FUSE" bit is key there (also, it shouldn't fail).
Ok, good. The question then is if you have a line like
DEBUG [U=1000,P=2992335] Mount() Executing FUSE mount command: "/usr/local/singularity/no-osuid/libexec/singularity/bin/squashfuse_ll -` ...
or similar, the "Executing FUSE" bit is key there (also, it shouldn't fail).
Again, thanks for your prompt reply, Pontus. No occurrences of "Executing" or "FUSE". Is that good or bad? 😁
Ok, that very likely means it needs to do it the bad way (unpack the image, might look something like
DEBUG [U=1000,P=1] StartProcess() Setting umask in container to 0022
DEBUG [U=1000,P=1] sylogBuiltin() Running action command exec
DEBUG [U=1000,P=3042218] PostStartProcess() Post start process
Parallel unsquashfs: Using 12 processors
444 inodes (152 blocks) to write
[===============================================================|] 596/596 100%!
(MISSING)
created 104 files
created 100 directories
created 340 symlinks
created 0 devices
created 0 fifos
created 0 sockets
created 0 hardlinks
DEBUG [U=1000,P=3042218] Master() Child exited with exit status 0 DEBUG [U=1000,P=3042145] extract() *** END WRAPPED UNSQUASHFS OUTPUT ***
They may be difficult for other reasons, but things you can try;
Install it yourself, I haven't tested a pre-built user installation of singularity in a while but apptainer definitely has a user-installable setup that does the right thing on most systems.
Parallel unsquashfs
Indeed, I got this:
DEBUG [U=912100779,P=1] StartProcess() Setting umask in container to 0022
DEBUG [U=912100779,P=1] sylogBuiltin() Running action command exec
DEBUG [U=912100779,P=17316]PostStartProcess() Post start process
Parallel unsquashfs: Using 40 processors
7217 inodes (8341 blocks) to write
[=============================================================\] 8341/8341 100%!
(MISSING)
created 5654 files
created 346 directories
created 1189 symlinks
created 0 devices
created 0 fifos
DEBUG [U=912100779,P=17316]Master() Child exited with exit status 0
I'm afraid singularity v4.0.2 is the most recent version we got available. I'll try to have our admin install a newer version of singularity. Should I also try to have him install apptainer?
Either should be fine if installed properly (and for extra clarity, this is also a user servicable part :)
Description of the bug
Running the Sarek pipeline while using Singularity results in the following error: bwa: error while loading shared lib. This issue occurs occasionally when running bwa to get the version information, producing the message: bwa: failed to load shared library. This error string is included instead of the version in the intermediate version.yml file generated by the module, leading to a corrupt YAML file (missing quotes around the string due to the presence of :). Consequently, the Nextflow job, that combines all these version.yml files into one, fails. Note that this error is occurring after 40 subsamples.
Command used and terminal output
Relevant files
No response
System information
No response