nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
409 stars 417 forks source link

Check that GZI index provided if using a gzipped FASTA #1741

Open pontushojer opened 3 days ago

pontushojer commented 3 days ago

Description of feature

I started a sarek run (v3.4.2) providing a custom reference in the form of a bgzipped FASTA. The run from FASTQs started normally and did not run into any errors until the MarkDuplicates step. I had missed copying an index file (*.fasta.gz.gzi) to the same folder as the FASTA which caused the step to fail just before finishing 🤦, see the error message below.

  [Thu Nov 28 20:40:47 GMT 2024] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 134.77 minutes.
  Runtime.totalMemory()=285212672
  [E::bgzf_index_load] Error opening GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz.gzi : No such file or directory
  [E::bgzf_open_ref] Unable to load .gzi index 'GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz.gzi'
  [E::refs_load_fai] Failed to open reference file 'GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz'
  [E::hts_open_format] Failed to open file "OPM2.md.cram" : Invalid argument
  samtools view: failed to open "OPM2.md.cram" for writing: Invalid argument

This is the relevant part of my parameter file

fasta: /proj/ngi2016004/private/strategic_proj/SR_23_02_Element_vs_Illumina/resources/GRCh38_GIABv3/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz
fasta_fai: /proj/ngi2016004/private/strategic_proj/SR_23_02_Element_vs_Illumina/resources/GRCh38_GIABv3/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz.fai
igenomes_ignore: true
bwa: /proj/ngi2016004/private/strategic_proj/SR_23_02_Element_vs_Illumina/resources/GRCh38_GIABv3/BWAIndex

It would be great if there could be a parameter check on start so that when a bgzipped fasta *.fasta.gz is provided a corresponding index *.fasta.gz.gzi should be present.

Some other considerations if this is hard to implement:

Edit: forgot to add info about sarek version Edit2: gzip --> bgzip

pontushojer commented 3 days ago

An update on this, the .gzi is now in the folder with the bgzipped FASTA reference but I still run into this error. Seems that it specifically is samtools that requires this .gzi file for converting the output to CRAM, see related issue: https://github.com/samtools/samtools/issues/804.

Looking at the relevant code, see below, it seems that the .gzi index is not included in the work folder causing the issue.

https://github.com/nf-core/sarek/blob/22c7315e9c9ccccf7658e9f18e36f99cd67ebfb9/modules/nf-core/gatk4/markduplicates/main.nf#L10C1-L14C1