nf-core / rnavar

gatk4 RNA variant calling pipeline
https://nf-co.re/rnavar
MIT License
37 stars 32 forks source link

Genome files clashing #55

Open ojziff opened 2 years ago

ojziff commented 2 years ago

Description of the bug

GATK4_BEDTOINTERVALLIST is failing because of picard not finding sequences (bed) in the sequence directory (dict):

 picard.PicardException: Sequence 'chr10_GL383545v1_alt' was not found in the sequence dictionary

Suggests the dict and bed are inconsistent. Searching through rnavar slack channel for chr10_GL383545v1_alt confirms 6 others have also posted about this very same issue. This stems from the usage docs not being clear on which reference files to ensure the pipeline sequence files match between processes. For GRCh38, it would be very helpful if suggested files (and where to access them) could be advised for each of:

I understand from @maxulysse that NCBI iGenomes files are advised but am finding the error above with these. Using iGenomes I also run into errors with STAR_ALIGN because the iGenomes STAR index version is different to the pipeline STAR version:

  EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.9a
  SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.4a

Should --igenomes_ignore true be made default?

Command used and terminal output

nextflow run nf-core/rnavar \
--input samplesheet.csv \
--outdir rnavar \
-c rnavar.config \
--igenomes_ignore true \
--read_length 150 \
--email oliver.ziff@crick.ac.uk -profile crick -r dev

Output

[28/8f289e] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:GTF2BED (genes.gtf)                                [100%] 1 of 1, cached: 1 ✔
[1b/cf281c] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:STAR_GENOMEGENERATE (genome.fa)                    [100%] 1 of 1, cached: 1 ✔
executor >  slurm (6)
[28/8f289e] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:GTF2BED (genes.gtf)                                [100%] 1 of 1, cached: 1 ✔
[1b/cf281c] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:STAR_GENOMEGENERATE (genome.fa)                    [100%] 1 of 1, cached: 1 ✔
[90/5cf789] process > NFCORE_RNAVAR:RNAVAR:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)                   [100%] 1 of 1, cached: 1 ✔
executor >  slurm (6)
[28/8f289e] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:GTF2BED (genes.gtf)                                [100%] 1 of 1, cached: 1 ✔
[1b/cf281c] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:STAR_GENOMEGENERATE (genome.fa)                    [100%] 1 of 1, cached: 1 ✔
[90/5cf789] process > NFCORE_RNAVAR:RNAVAR:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)                   [100%] 1 of 1, cached: 1 ✔
[8a/a5098c] process > NFCORE_RNAVAR:RNAVAR:CAT_FASTQ (ctrl)                                                  [100%] 4 of 4, cached: 4 ✔
[63/bcf2f5] process > NFCORE_RNAVAR:RNAVAR:FASTQC (ctrl)                                                     [100%] 4 of 4, cached: 3 ✔
[aa/8266d8] process > NFCORE_RNAVAR:RNAVAR:GATK4_BEDTOINTERVALLIST (genome.bed)                              [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_RNAVAR:RNAVAR:GATK4_INTERVALLISTTOOLS                                           -
[8f/8d506a] process > NFCORE_RNAVAR:RNAVAR:ALIGN_STAR:STAR_ALIGN (iso)                                       [  0%] 0 of 4
[-        ] process > NFCORE_RNAVAR:RNAVAR:ALIGN_STAR:BAM_SORT_SAMTOOLS:SAMTOOLS_SORT                        -
[-        ] process > NFCORE_RNAVAR:RNAVAR:ALIGN_STAR:BAM_SORT_SAMTOOLS:SAMTOOLS_INDEX                       -
[-        ] process > NFCORE_RNAVAR:RNAVAR:ALIGN_STAR:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_STATS    -
[-        ] process > NFCORE_RNAVAR:RNAVAR:ALIGN_STAR:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_FLAGSTAT -
[-        ] process > NFCORE_RNAVAR:RNAVAR:ALIGN_STAR:BAM_SORT_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_IDXSTATS -
[-        ] process > NFCORE_RNAVAR:RNAVAR:MARKDUPLICATES:GATK4_MARKDUPLICATES                               -
[-        ] process > NFCORE_RNAVAR:RNAVAR:MARKDUPLICATES:SAMTOOLS_INDEX                                     -
[-        ] process > NFCORE_RNAVAR:RNAVAR:MARKDUPLICATES:BAM_STATS_SAMTOOLS:SAMTOOLS_STATS                  -
[-        ] process > NFCORE_RNAVAR:RNAVAR:MARKDUPLICATES:BAM_STATS_SAMTOOLS:SAMTOOLS_FLAGSTAT               -
[-        ] process > NFCORE_RNAVAR:RNAVAR:MARKDUPLICATES:BAM_STATS_SAMTOOLS:SAMTOOLS_IDXSTATS               -
[-        ] process > NFCORE_RNAVAR:RNAVAR:SPLITNCIGAR:GATK4_SPLITNCIGARREADS                                -
[-        ] process > NFCORE_RNAVAR:RNAVAR:SPLITNCIGAR:SAMTOOLS_MERGE                                        -
[-        ] process > NFCORE_RNAVAR:RNAVAR:SPLITNCIGAR:SAMTOOLS_INDEX                                        -
[-        ] process > NFCORE_RNAVAR:RNAVAR:GATK4_BASERECALIBRATOR                                            -
[-        ] process > NFCORE_RNAVAR:RNAVAR:RECALIBRATE:APPLYBQSR                                             -
[-        ] process > NFCORE_RNAVAR:RNAVAR:RECALIBRATE:SAMTOOLS_INDEX                                        -
[-        ] process > NFCORE_RNAVAR:RNAVAR:RECALIBRATE:SAMTOOLS_STATS                                        -
[-        ] process > NFCORE_RNAVAR:RNAVAR:GATK4_HAPLOTYPECALLER                                             -
[-        ] process > NFCORE_RNAVAR:RNAVAR:GATK4_MERGEVCFS                                                   -
[-        ] process > NFCORE_RNAVAR:RNAVAR:TABIX                                                             -
[-        ] process > NFCORE_RNAVAR:RNAVAR:GATK4_VARIANTFILTRATION                                           -
[-        ] process > NFCORE_RNAVAR:RNAVAR:CUSTOM_DUMPSOFTWAREVERSIONS                                       -
[-        ] process > NFCORE_RNAVAR:RNAVAR:MULTIQC                                                           -

Error executing process > 'NFCORE_RNAVAR:RNAVAR:GATK4_BEDTOINTERVALLIST (genome.bed)'

Caused by:
  Process `NFCORE_RNAVAR:RNAVAR:GATK4_BEDTOINTERVALLIST (genome.bed)` terminated with an error exit status (3)

Command executed:

  gatk --java-options "-Xmx36g" BedToIntervalList \
      --INPUT exome.bed \
      --OUTPUT genome.bed.interval_list \
      --SEQUENCE_DICTIONARY genome.dict \
      --TMP_DIR . \

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNAVAR:RNAVAR:GATK4_BEDTOINTERVALLIST":
      gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
  END_VERSIONS

Command exit status:
  3

Command output:
  (empty)

Command error:
  Using GATK jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx36g -jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar BedToIntervalList --INPUT exome.bed --OUTPUT genome.bed.interval_list --SEQUENCE_DICTIONARY genome.dict --TMP_DIR .
  07:56:36.815 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
  [Fri Jun 10 07:56:36 GMT 2022] BedToIntervalList --INPUT exome.bed --SEQUENCE_DICTIONARY genome.dict --OUTPUT genome.bed.interval_list --TMP_DIR . --SORT true --UNIQUE false --DROP_MISSING_CONTIGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
  [Fri Jun 10 07:56:37 GMT 2022] Executing as ziffo@ca010 on Linux 3.10.0-1160.62.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.9.1-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.2.6.1
  [Fri Jun 10 07:56:37 GMT 2022] picard.util.BedToIntervalList done. Elapsed time: 0.01 minutes.
  Runtime.totalMemory()=2147483648
  To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
  picard.PicardException: Sequence 'chr10_GL383545v1_alt' was not found in the sequence dictionary
  /camp/at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:156)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
  rk dirat org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
  /camp/at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

Work dir:
  /camp/project/proj-luscombn-patani/working/public-data/ipsc-mn-c9orf72-fus-catanese-2021/work/aa/8266d89bfd7f43fff0c0f9d1a1c763

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Relevant files

rnavar.config

params {
    bwa                   = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/'
    chr_dir               = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/Chromosomes"
    dbsnp                 = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Homo_sapiens_assembly38.dbsnp138.vcf'
    dbsnp_tbi             = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Homo_sapiens_assembly38.dbsnp138.vcf.idx'
    known_indels          = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz'
    known_indels_tbi      = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi'
    dict                  = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.dict"
    fasta                 = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
    fasta_fai             = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa.fai"
    snpeff_db             = 'GRCh38.99'
    snpeff_genome         = 'GRCh38'
    vep_cache_version     = '104'
    vep_genome            = 'GRCh38'
    vep_species           = 'homo_sapiens'
    bowtie2               = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/'
    star                  = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/'
    bismark               = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/'
    gtf                   = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf'
    bed12                 = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed'
}

System information

N E X T F L O W ~ version 21.10.3 HPC at Crick Executor slurm Container singularity OS Linux version dev

ojziff commented 2 years ago

Adding --DROP_MISSING_CONTIGS TRUE to my config gets around the missing contigs error in GATK4_BEDTOINTERVALLIST:

process {
  withName:  'NFCORE_RNAVAR:RNAVAR:GATK4_BEDTOINTERVALLIST' {
    ext.args = '--DROP_MISSING_CONTIGS TRUE'
  }
}