nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
351 stars 387 forks source link

Sarek copies dbSNP for each invocation of HaplotypeCaller #1127

Open Schmytzi opened 1 year ago

Schmytzi commented 1 year ago

Description of the bug

I am running Sarek using a custom reference and dbSNP version on a SLURM cluster. Each invocation of HaplotypeCaller produces a copy of dbSNP in the process's working directory. This fills up my project space because each copy is 24 GB.

I explicitly set stageInMode to 'symlink' in my nextflow.config and the issue is still occuring. A look at the .command.run produced by Nextflow also shows that the file should be symlinked and not copied.

I am suspecting this happens because the dbSNP files match the *.vcf.gz and *.tbi globs defined in the output of GATK4_HAPLOTYPECALLER and therefore, Nextflow stages them for output by copying them to the working directory.

Command used and terminal output

nextflow run $NF_CORE_PIPELINES/sarek/3.1.1/workflow \
  -profile uppmax \
  -resume \
  -c nextflow.config \
  --project $PROJECT \
  --input samplesheet.csv \
  --fasta <long_path>/remapping_resources/reference/T2T-CHM13.fna \
  --dbsnp <long_path>/remapping_resources/dbSNP/chm13v2.0_dbSNPv155.vcf.gz \
  --step variant_calling \
  --tools snpeff,haplotypecaller \
  --save_reference \
  --igenomes_ignore \
  --joint_germline \
  --outdir out

Relevant files

custom config and .command.run from affected job: sarek_files.zip

System information

FriederikeHanssen commented 1 year ago

oh interesting observation, thanks for reporting. Will look into it, surprised since the file is not published and not passed over to the next processes, so nextflow does appear to correctly differentiate between dbsnp being provided as input and not add it to the output

FriederikeHanssen commented 11 months ago

Hey! Started looking into this for various processes and couldn't find anything suspicious. Can you send me the output of ls -lha of some of these work directories where you observed it?

FriederikeHanssen commented 11 months ago

More precisely here for Haplotypecaller:

 ls -lha work/e6/6233fc8de892bb48cc0a6bc96b2936/
total 60K
drwxr-xr-x 2 gitpod gitpod 4.0K Aug  4 22:12 .
drwxr-xr-x 3 gitpod gitpod   44 Aug  4 22:12 ..
lrwxrwxrwx 1 gitpod gitpod   73 Aug  4 22:12 chr22_2-15000.bed -> /workspace/sarek/work/4a/e713f5199bd3abd5a88219934a4d5f/chr22_2-15000.bed
-rw-r--r-- 1 gitpod gitpod    0 Aug  4 22:12 .command.begin
-rw-r--r-- 1 gitpod gitpod 5.2K Aug  4 22:12 .command.err
-rw-r--r-- 1 gitpod gitpod 5.2K Aug  4 22:12 .command.log
-rw-r--r-- 1 gitpod gitpod    0 Aug  4 22:12 .command.out
-rw-r--r-- 1 gitpod gitpod  12K Aug  4 22:12 .command.run
-rw-r--r-- 1 gitpod gitpod  541 Aug  4 22:12 .command.sh
-rw-r--r-- 1 gitpod root    214 Aug  4 22:12 .command.trace
lrwxrwxrwx 1 gitpod gitpod  120 Aug  4 22:12 dbsnp_146.hg38.vcf.gz -> /workspace/sarek/work/stage-83ded392-6c15-4dae-a7a9-26926ad39af0/75/2b5c5c28cec834e0067ab069b76dca/dbsnp_146.hg38.vcf.gz
lrwxrwxrwx 1 gitpod gitpod   81 Aug  4 22:12 dbsnp_146.hg38.vcf.gz.tbi -> /workspace/sarek/work/46/91252f6ac96ee77a6b0f7687140d88/dbsnp_146.hg38.vcf.gz.tbi
-rw-r--r-- 1 gitpod gitpod    1 Aug  4 22:12 .exitcode
lrwxrwxrwx 1 gitpod gitpod   67 Aug  4 22:12 genome.dict -> /workspace/sarek/work/3a/f373baf0e1d9e519dd448102435e3a/genome.dict
lrwxrwxrwx 1 gitpod gitpod  111 Aug  4 22:12 genome.fasta -> /workspace/sarek/work/stage-83ded392-6c15-4dae-a7a9-26926ad39af0/b4/6ac20aadb1586b8973c10197b3670d/genome.fasta
lrwxrwxrwx 1 gitpod gitpod   72 Aug  4 22:12 genome.fasta.fai -> /workspace/sarek/work/a0/3ea5f2bbbb2e0a05c362e387985584/genome.fasta.fai
lrwxrwxrwx 1 gitpod gitpod   75 Aug  4 22:12 test.converted.cram -> /workspace/sarek/work/de/859a2240e11944df82d582ceaec392/test.converted.cram
lrwxrwxrwx 1 gitpod gitpod   80 Aug  4 22:12 test.converted.cram.crai -> /workspace/sarek/work/de/859a2240e11944df82d582ceaec392/test.converted.cram.crai
-rw-r--r-- 1 gitpod root   4.3K Aug  4 22:12 test.haplotypecaller.chr22_2-15000.vcf.gz
-rw-r--r-- 1 gitpod root    105 Aug  4 22:12 test.haplotypecaller.chr22_2-15000.vcf.gz.tbi
-rw-r--r-- 1 gitpod root    132 Aug  4 22:12 versions.yml
Schmytzi commented 10 months ago

Here is an example:

❯ ls -lha
total 24G
drwxrwsr-x 2 danies sens2022031 4.0K Aug 22 07:05 .
drwxrwsr-x 3 danies sens2022031 4.0K Aug 22 05:28 ..
-rw-rw-r-- 1 danies sens2022031  24G Aug 22 07:05 chm13v2.0_dbSNPv155.vcf.gz
-rw-rw-r-- 1 danies sens2022031 2.8M Aug 22 07:05 chm13v2.0_dbSNPv155.vcf.gz.tbi
-rw-rw-r-- 1 danies sens2022031    0 Aug 22 05:36 .command.begin
-rw-rw-r-- 1 danies sens2022031  63K Aug 22 07:02 .command.err
-rw-rw-r-- 1 danies sens2022031  63K Aug 22 07:02 .command.log
-rw-rw-r-- 1 danies sens2022031    0 Aug 22 07:02 .command.out
-rw-rw-r-- 1 danies sens2022031  13K Aug 22 05:28 .command.run
-rw-rw-r-- 1 danies sens2022031  577 Aug 22 05:28 .command.sh
-rw-rw-r-- 1 danies sens2022031  241 Aug 22 07:02 .command.trace
-rw-rw-r-- 1 danies sens2022031    1 Aug 22 07:05 .exitcode
-rw-rw-r-- 1 danies sens2022031 290M Aug 22 07:05 SweGen0999.haplotypecaller.chr8_1-146259331.g.vcf.gz
-rw-rw-r-- 1 danies sens2022031 253K Aug 22 07:05 SweGen0999.haplotypecaller.chr8_1-146259331.g.vcf.gz.tbi
-rw-rw-r-- 1 danies sens2022031  132 Aug 22 07:05 versions.yml

This is in joint germline calling mode. I might have missed to specify that explicitly. This was produced with sarek 3.2.0.

Also, the institutional configuration sets the scratch directive for all processes to node-local storage, which might be the reason why the other input files are not visible here.