Open Schmytzi opened 1 year ago
oh interesting observation, thanks for reporting. Will look into it, surprised since the file is not published and not passed over to the next processes, so nextflow does appear to correctly differentiate between dbsnp being provided as input and not add it to the output
Hey! Started looking into this for various processes and couldn't find anything suspicious. Can you send me the output of ls -lha
of some of these work directories where you observed it?
More precisely here for Haplotypecaller:
ls -lha work/e6/6233fc8de892bb48cc0a6bc96b2936/
total 60K
drwxr-xr-x 2 gitpod gitpod 4.0K Aug 4 22:12 .
drwxr-xr-x 3 gitpod gitpod 44 Aug 4 22:12 ..
lrwxrwxrwx 1 gitpod gitpod 73 Aug 4 22:12 chr22_2-15000.bed -> /workspace/sarek/work/4a/e713f5199bd3abd5a88219934a4d5f/chr22_2-15000.bed
-rw-r--r-- 1 gitpod gitpod 0 Aug 4 22:12 .command.begin
-rw-r--r-- 1 gitpod gitpod 5.2K Aug 4 22:12 .command.err
-rw-r--r-- 1 gitpod gitpod 5.2K Aug 4 22:12 .command.log
-rw-r--r-- 1 gitpod gitpod 0 Aug 4 22:12 .command.out
-rw-r--r-- 1 gitpod gitpod 12K Aug 4 22:12 .command.run
-rw-r--r-- 1 gitpod gitpod 541 Aug 4 22:12 .command.sh
-rw-r--r-- 1 gitpod root 214 Aug 4 22:12 .command.trace
lrwxrwxrwx 1 gitpod gitpod 120 Aug 4 22:12 dbsnp_146.hg38.vcf.gz -> /workspace/sarek/work/stage-83ded392-6c15-4dae-a7a9-26926ad39af0/75/2b5c5c28cec834e0067ab069b76dca/dbsnp_146.hg38.vcf.gz
lrwxrwxrwx 1 gitpod gitpod 81 Aug 4 22:12 dbsnp_146.hg38.vcf.gz.tbi -> /workspace/sarek/work/46/91252f6ac96ee77a6b0f7687140d88/dbsnp_146.hg38.vcf.gz.tbi
-rw-r--r-- 1 gitpod gitpod 1 Aug 4 22:12 .exitcode
lrwxrwxrwx 1 gitpod gitpod 67 Aug 4 22:12 genome.dict -> /workspace/sarek/work/3a/f373baf0e1d9e519dd448102435e3a/genome.dict
lrwxrwxrwx 1 gitpod gitpod 111 Aug 4 22:12 genome.fasta -> /workspace/sarek/work/stage-83ded392-6c15-4dae-a7a9-26926ad39af0/b4/6ac20aadb1586b8973c10197b3670d/genome.fasta
lrwxrwxrwx 1 gitpod gitpod 72 Aug 4 22:12 genome.fasta.fai -> /workspace/sarek/work/a0/3ea5f2bbbb2e0a05c362e387985584/genome.fasta.fai
lrwxrwxrwx 1 gitpod gitpod 75 Aug 4 22:12 test.converted.cram -> /workspace/sarek/work/de/859a2240e11944df82d582ceaec392/test.converted.cram
lrwxrwxrwx 1 gitpod gitpod 80 Aug 4 22:12 test.converted.cram.crai -> /workspace/sarek/work/de/859a2240e11944df82d582ceaec392/test.converted.cram.crai
-rw-r--r-- 1 gitpod root 4.3K Aug 4 22:12 test.haplotypecaller.chr22_2-15000.vcf.gz
-rw-r--r-- 1 gitpod root 105 Aug 4 22:12 test.haplotypecaller.chr22_2-15000.vcf.gz.tbi
-rw-r--r-- 1 gitpod root 132 Aug 4 22:12 versions.yml
Here is an example:
❯ ls -lha
total 24G
drwxrwsr-x 2 danies sens2022031 4.0K Aug 22 07:05 .
drwxrwsr-x 3 danies sens2022031 4.0K Aug 22 05:28 ..
-rw-rw-r-- 1 danies sens2022031 24G Aug 22 07:05 chm13v2.0_dbSNPv155.vcf.gz
-rw-rw-r-- 1 danies sens2022031 2.8M Aug 22 07:05 chm13v2.0_dbSNPv155.vcf.gz.tbi
-rw-rw-r-- 1 danies sens2022031 0 Aug 22 05:36 .command.begin
-rw-rw-r-- 1 danies sens2022031 63K Aug 22 07:02 .command.err
-rw-rw-r-- 1 danies sens2022031 63K Aug 22 07:02 .command.log
-rw-rw-r-- 1 danies sens2022031 0 Aug 22 07:02 .command.out
-rw-rw-r-- 1 danies sens2022031 13K Aug 22 05:28 .command.run
-rw-rw-r-- 1 danies sens2022031 577 Aug 22 05:28 .command.sh
-rw-rw-r-- 1 danies sens2022031 241 Aug 22 07:02 .command.trace
-rw-rw-r-- 1 danies sens2022031 1 Aug 22 07:05 .exitcode
-rw-rw-r-- 1 danies sens2022031 290M Aug 22 07:05 SweGen0999.haplotypecaller.chr8_1-146259331.g.vcf.gz
-rw-rw-r-- 1 danies sens2022031 253K Aug 22 07:05 SweGen0999.haplotypecaller.chr8_1-146259331.g.vcf.gz.tbi
-rw-rw-r-- 1 danies sens2022031 132 Aug 22 07:05 versions.yml
This is in joint germline calling mode. I might have missed to specify that explicitly. This was produced with sarek 3.2.0.
Also, the institutional configuration sets the scratch
directive for all processes to node-local storage, which might be the reason why the other input files are not visible here.
Description of the bug
I am running Sarek using a custom reference and dbSNP version on a SLURM cluster. Each invocation of HaplotypeCaller produces a copy of dbSNP in the process's working directory. This fills up my project space because each copy is 24 GB.
I explicitly set stageInMode to 'symlink' in my nextflow.config and the issue is still occuring. A look at the .command.run produced by Nextflow also shows that the file should be symlinked and not copied.
I am suspecting this happens because the dbSNP files match the
*.vcf.gz
and*.tbi
globs defined in the output of GATK4_HAPLOTYPECALLER and therefore, Nextflow stages them for output by copying them to the working directory.Command used and terminal output
Relevant files
custom config and .command.run from affected job: sarek_files.zip
System information