nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
393 stars 403 forks source link

Error while loading shared library for BWA in Sarek pipeline with Singularity #1575

Open Simonassaf opened 3 months ago

Simonassaf commented 3 months ago

Description of the bug

Running the Sarek pipeline while using Singularity results in the following error: bwa: error while loading shared lib. This issue occurs occasionally when running bwa to get the version information, producing the message: bwa: failed to load shared library. This error string is included instead of the version in the intermediate version.yml file generated by the module, leading to a corrupt YAML file (missing quotes around the string due to the presence of :). Consequently, the Nextflow job, that combines all these version.yml files into one, fails. Note that this error is occurring after 40 subsamples.

Command used and terminal output

./nextflow run nf-core/sarek -r 3.4.2 -profile singularity --input ./input-sarek.csv --outdir /home/sarek-nf-results --genome null --igenomes_ignore --fasta ./GCF_002263795.3_ARS-UCD2.0_genomic.fna  --skip_tools baserecalibrator --tools deepvariant

[b9/3b06bd] process > NFCORE_SAREK:SAREK:FASTQC (zr8425-99-lane_99)                                                              [ 98%] 259 of 263, cached: 259
[d2/7b9351] process > NFCORE_SAREK:SAREK:FASTP (zr8425-595-lane_595)                                                             [ 56%] 149 of 263, cached: 149
[1f/2990a2] process > NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM1_MEM (zr8425-105-lane_105)              [  2%] 40 of 1787
[-        ] process > NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM2_MEM                                    -
[-        ] process > NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:DRAGMAP_ALIGN                                  -
[-        ] process > NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:SENTIEON_BWAMEM                                -
[-        ] process > NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES                                                 -
[-        ] process > NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:CRAM_QC_MOSDEPTH_SAMTOOLS:SAMTOOLS_STATS                             -
[-        ] process > NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:CRAM_QC_MOSDEPTH_SAMTOOLS:MOSDEPTH                                   -
[-        ] process > NFCORE_SAREK:SAREK:CRAM_TO_BAM                                                                             -
[-        ] process > NFCORE_SAREK:SAREK:CRAM_SAMPLEQC:BAM_NGSCHECKMATE:BCFTOOLS_MPILEUP                                         -
Plus 9 more processes waiting for tasks…
ERROR ~ mapping values are not allowed here
 in 'reader', line 2, column 13:
        bwa: bwa: error while loading shared lib ...
                ^

Relevant files

No response

System information

No response

asp8200 commented 3 months ago

Discussion on nf-core/sarek Slack about a seemingly very similar issue:

https://nfcore.slack.com/archives/CGFUX04HZ/p1718984599684549

FriederikeHanssen commented 3 months ago

Hey! Something seems to be wrong in a weird way.I see that at least 40 tasks finished successfully. So it's not generally wrong, but borks in some cases. Are you able to retrieve on which compute nodes the jobs ran? is there a particular node that has issues?

fbdtemme commented 3 months ago

I have been working with @Simonassaf on investigating this. There were 2 runs done on the head node of an SGE cluster. Both of them failed exactly after 40 samples.

As a workaround suggested by @asp8200 on slack I have edited the pipeline and just removed the call to extract the version from bwa and hardcoded it instead. Weirdly enough that seems to be working. It is really strange that only calls to get the version fail and not the actual command itself.

We will do some test runs soon both running everything on a single compute node and using the sge executor to see if it this failure pops as well.

asp8200 commented 3 months ago

I have been working with @Simonassaf on investigating this. There were 2 runs done on the head node of an SGE cluster. Both of them failed exactly after 40 samples.

As a workaround suggested by @asp8200 on slack I have edited the pipeline and just removed the call to extract the version from bwa and hardcoded it instead. Weirdly enough that seems to be working. It is really strange that only calls to get the version fail and not the actual command itself.

We will do some test runs soon both running everything on a single compute node and using the sge executor to see if it this failure pops as well.

It is indeed very weird. I don't think we ever got to the bottom of the problem, but I'll ask my colleagues.

fbdtemme commented 3 months ago

When running the pipeline over multiple nodes using the sge executor this issue does not occur. So maybe it is triggered by running with singularity on a single node?

asp8200 commented 3 weeks ago

I got that problem again. This time with Sarek/3.4.4 (test_full_germline with singularity):

ERROR ~ mapping values are not allowed here
 in 'reader', line 2, column 13:
        bwa: bwa: error while loading shared lib ...
                ^
 -- Check script '/ngc/tools/ngctools/sarek/3.4.4/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/utils_nfcore_sarek_pipeline/../../nf-core/utils_nfcore_pipeline/main.nf' at line: 106 or see '.nextflow.log' file for more details

It used to be CUSTOM_DUMPSOFTWAREVERSIONS would fail due to garbage (error messages) in some versions.yml-files, and now I believe something similar might be happening with the functionality that replaced CUSTOM_DUMPSOFTWAREVERSIONS.

Most of the versions.yml-files for the BWA-jobs look like this:

"NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM1_MEM":
    bwa: 0.7.18-r1243-dirty
    samtools: 1.20

which seems fine to me, but I found one that look like this:

"NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM1_MEM":
    bwa: bwa: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
    samtools: samtools: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory

And that versions.yml-file is from a BWA-job which Nextflow claims finished successfully 😬

@maxulysse : I thought the new "versions"-functionality was immune to those kinds of malformed versions.yml-files?

asp8200 commented 3 weeks ago

@pontus and @adamrtalbot : I'm thinking this issue might also be of interest to you.

I'm slightly worried about the fact that the calls to bwa and samtools fail in the "version section" of the bwa-module while the corresponding bwa-job still completes successfully (at least according to Nextflow). The pipeline only fails because the versions.yml-file is malformed - does that mean that bwa and samtools could have somehow failed "silently" in other ways (not affecting the "versions section" but perhaps the alignment itself 🫣 ) in other runs?

pontus commented 2 weeks ago

Can anyone trigger this at will? In that case it would be interesting to see. I recall some rather weird instances of this happening but generally, my first guess as to why would be some cluster where singularity is unable to mount the image directly but rather unpacks it onto some file system that behaves oddly.

asp8200 commented 2 weeks ago

Can anyone trigger this at will? In that case it would be interesting to see. I recall some rather weird instances of this happening but generally, my first guess as to why would be some cluster where singularity is unable to mount the image directly but rather unpacks it onto some file system that behaves oddly.

The error pops up once in a while on our HPC. Today it wasn't BWA but VEP that triggered the error:

ERROR ~ mapping values are not allowed here
in 'reader', line 2, column 21:
       ensemblvep: perl:
                       ^

-- Check script '/ngc/tools/ngctools/sarek/3.4.4/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/utils_nfcore_sarek_pipeline/../../nf-core/utils_nfcore_pipeline/main.nf' at line: 106 or see '.nextflow.log' file for more details 

The versions.yml for VEP is malformed and looks like this:

"NFCORE_SAREK:SAREK:VCF_ANNOTATE_ALL:VCF_ANNOTATE_ENSEMBLVEP:ENSEMBLVEP_VEP":
    ensemblvep: perl:

I'm running test_full_germline on sarek/3.4.4, jdk/21.0.1, NF 24.04.3, and singularity/4.0.2.

We thought it might have something to do with too many jobs being submitted at the same time to pbs/torque, but now it occurred with queueSize = 5. This is the executor-config:

executor {
    name = 'pbs'
    queueSize = 5
    pollInterval = '30 sec'
    submitRateLimit = '3 sec'
    exitReadTimeout = '600 sec'
}
asp8200 commented 2 weeks ago

The same job can produce a valid versions.yml on qsub .command.run 🤔

asp8200 commented 1 week ago

@pontus : The problem with error messages getting written in the versions.yml-files goes away if I (1) hardcode the version-numbers in the versions.yml-generating code in the modules or (2) move the versions.yml-generating code in the very top of the script-section of the modules.

I think this whole issue is due to instability on our HPC, but it would be nice if Sarek was somehow setup to handle such instability in a better way. If a malformed versions.yml is written, then the run of the pipeline does eventually fail, but it is the Nextflow-job that is crashing with a msg like

ERROR ~ mapping values are not allowed here
 in 'reader', line 2, column 13:
        bwa: bwa: error while loading shared lib ...

(I think the module-job itself should have crashed.)

Restarting the pipeline with -resume doesn't help; the pipeline will fail with the same message, because NF thinks the module was executed correctly and so it doesn't re-run the module-job.

The instability on our HPC also results in a lot of errors like the following which is from the execution of the nf-modules in Sarek:

samtools: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory

Those kinds of errors cause the execution of the module-job to fail, and so the module-job gets "retries". I'd like to see the same behaviour from NF/Sarek if the error occurs during the execution of the versions.yml-generating code-snippet of the modules.

pontus commented 1 week ago

It's likely is the case that the versions handling code doesn't do the right thing, exactly visavi-errors.

On the other hand, there's only so much that can be done to manage broken environments and getting it fixed for all modules and making everything behave seems non-trivial.

Instead, my recommendation would be to as far as possible make you suffer less from when your cluster is unstable. I've already mention that this looks to me like a setup where the images are unpacked rather than mounted, is that the case? If so, I'd recommend fixing that (apptainer has an installer for a non-privileged version that will do the right thing for most current systems).

asp8200 commented 1 week ago

I've already mention that this looks to me like a setup where the images are unpacked rather than mounted, is that the case? If so, I'd recommend fixing that (apptainer has an installer for a non-privileged version that will do the right thing for most current systems).

How do I tell if "images are unpacked rather than mounted"?

From our custom.config:

singularity {
    enabled = true
    autoMounts = true
}

I am guess that that means the images are mounted, no?

pontus commented 1 week ago

No, that's not it.

Since I guess this is on HPC, the first thing is to figure out what you run.

Once you know that, you can try e.g. shelling in to an image or running or whatever, either it may be obvious from there, but if not you can add -d (as in e.g. apptainer -d shell docker://alpine).

asp8200 commented 1 week ago

No, that's not it.

Since I guess this is on HPC, the first thing is to figure out what you run.

Once you know that, you can try e.g. shelling in to an image or running or whatever, either it may be obvious from there, but if not you can add -d (as in e.g. apptainer -d shell docker://alpine).

Thanks for your reply, Pontus. I'm afraid I still don't understand what you mean by "images are unpacked rather than mounted". I did find the singularity exec-command in the .command.run-file for one of the failing jobs:

nxf_launch() {
    set +u; env - PATH="$PATH" ${TMP:+SINGULARITYENV_TMP="$TMP"} ${TMPDIR:+SINGULARITYENV_TMPDIR="$TMPDIR"} ${NXF_TASK_WORKDIR:+SINGULARITYENV_NXF_TASK_WORKDIR="$NXF_TASK_WORKDIR"} SINGULARITYENV_NXF_DEBUG="${NXF_DEBUG:=0}" singularity exec --no-home --pid -B /ngc -B "$NXF_TASK_WORKDIR" /ngc/shared/nf_tools/containers/depot.galaxyproject.org-singularity-strelka-2.9.10--h9ee0642_1.img /bin/bash -c "cd $NXF_TASK_WORKDIR; eval $(nxf_container_env); /bin/bash /ngc/projects2/bio/people/andped/test/runs/19/work/61/15ed6016524b3407536d99be8821f5/.command.run nxf_trace"
}

Is it perhaps clear from that singularity-command whether images are unpacked or mounted?

pontus commented 1 week ago

Sorry, no, it's not visible from the command.

Did you have any beforeScript in place? Your system may have one version of singularity installed "directly" and another you get when you load a module.

If you replicate that, you can use the above but with apptainer substituted for singularity.

asp8200 commented 1 week ago

Sorry, no, it's not visible from the command.

Did you have any beforeScript in place? Your system may have one version of singularity installed "directly" and another you get when you load a module.

If you replicate that, you can use the above but with apptainer substituted for singularity.

Yes, we're using a beforeScript which loads a module with singularity.

beforeScript = "module load ngctools jdk/21.0.1 singularity/4.0.2;"

On our login-node:

$module load singularity/4.0.2
$ singularity -d shell /ngc/shared/nf_tools/containers/depot.galaxyproject.org-singularity-strelka-2.9.10--h9ee0642_1.img
DEBUG   [U=912100779,P=17093]persistentPreRun()            Singularity version: 4.0.2
DEBUG   [U=912100779,P=17093]persistentPreRun()            Parsing configuration file /ngc/tools/dtutools/singularity/4.0.2/etc/singularity/singularity.conf
...
DEBUG   [U=912100779,P=19412]sylogBuiltin()                Running action command shell
DEBUG   [U=912100779,P=17093]PostStartProcess()            Post start process
Singularity>

That command prints a lot of debug-statements to stdout. Not sure what I'm looking for, but it contains the string "mount" 359 times, while it doesn't contain the strings "pack", "compr" or "zip" 😁 (I guess I could attach the full log here - if need be.)

pontus commented 1 week ago

Ok, good. The question then is if you have a line like

DEBUG   [U=1000,P=2992335] Mount()                       Executing FUSE mount command: "/usr/local/singularity/no-osuid/libexec/singularity/bin/squashfuse_ll -` ... 

or similar, the "Executing FUSE" bit is key there (also, it shouldn't fail).

asp8200 commented 1 week ago

Ok, good. The question then is if you have a line like

DEBUG   [U=1000,P=2992335] Mount()                       Executing FUSE mount command: "/usr/local/singularity/no-osuid/libexec/singularity/bin/squashfuse_ll -` ... 

or similar, the "Executing FUSE" bit is key there (also, it shouldn't fail).

Again, thanks for your prompt reply, Pontus. No occurrences of "Executing" or "FUSE". Is that good or bad? 😁

pontus commented 1 week ago

Ok, that very likely means it needs to do it the bad way (unpack the image, might look something like

DEBUG   [U=1000,P=1]       StartProcess()                Setting umask in container to 0022                                                                                                   
DEBUG   [U=1000,P=1]       sylogBuiltin()                Running action command exec                                                                                                          
DEBUG   [U=1000,P=3042218] PostStartProcess()            Post start process                                                                                                                   
Parallel unsquashfs: Using 12 processors                                                                                                                                                      
444 inodes (152 blocks) to write                                                                                                                                                              

[===============================================================|] 596/596 100%!                                                                                                              
(MISSING)                                                                                                                                                                                     
created 104 files                                                                                                                                                                             
created 100 directories                                               
created 340 symlinks                                                                                                                                                                          
created 0 devices                                                                                                                                                                             
created 0 fifos                                                                                                                                                                               
created 0 sockets                                                                                                                                                                             
created 0 hardlinks                                                                                                                                                                           
DEBUG   [U=1000,P=3042218] Master()                      Child exited with exit status 0                                                                                                      DEBUG   [U=1000,P=3042145] extract()                     *** END WRAPPED UNSQUASHFS OUTPUT ***         
pontus commented 1 week ago

They may be difficult for other reasons, but things you can try;

Install it yourself, I haven't tested a pre-built user installation of singularity in a while but apptainer definitely has a user-installable setup that does the right thing on most systems.

asp8200 commented 1 week ago

Parallel unsquashfs

Indeed, I got this:

DEBUG   [U=912100779,P=1]  StartProcess()                Setting umask in container to 0022
DEBUG   [U=912100779,P=1]  sylogBuiltin()                Running action command exec
DEBUG   [U=912100779,P=17316]PostStartProcess()            Post start process
Parallel unsquashfs: Using 40 processors
7217 inodes (8341 blocks) to write

[=============================================================\] 8341/8341 100%!
(MISSING)
created 5654 files
created 346 directories
created 1189 symlinks
created 0 devices
created 0 fifos
DEBUG   [U=912100779,P=17316]Master()                      Child exited with exit status 0

I'm afraid singularity v4.0.2 is the most recent version we got available. I'll try to have our admin install a newer version of singularity. Should I also try to have him install apptainer?

pontus commented 1 week ago

Either should be fine if installed properly (and for extra clarity, this is also a user servicable part :)