hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
189 stars 58 forks source link

SAGE error 'Cannot read file "TransStart' because 'x' is null on Ensembl homo_sapiens_core_111_38 #617

Closed ypradat closed 1 day ago

ypradat commented 2 days ago

Dear HMF team,

I am trying to run SAGE v3.3. on a pair of tumor-normal whole-exome sequenced. Below is my command

#!/bin/bash

# Ref fasta
fasta="../resources/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta"

# BED file with coordinates of known mutation hotspots
hotspots="../resources/hmf_dna_pipeline/v6_0/ref/38/variants/KnownHotspots.somatic.38.vcf.gz"

# BED file with coordinates of coding exons of actionable genes
actionable_coding="../resources/hmf_dna_pipeline/v6_0/ref/38/variants/ActionableCodingPanel.38.bed.gz"

# BED file with coordinates of GiAB high confidence regions for variant calling
high_confidence="../resources/hmf_dna_pipeline/v6_0/ref/38/variants/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed.gz"

# Ensembl cache
ensembl_cache="../resources/hmf_dna_pipeline/ensembl_cache/homo_sapiens_core_111_38"

# BAM files
ref="NORMAL"
tum="TUMOR"
ref_bam="../data/bam/deduplicated/${ref}.bam"
tum_bam="../data/bam/deduplicated/${tum}.bam"
out_vcf="${tum}_vs_${ref}.vcf.gz"
log="${tum}_vs_${ref}.log"

java -Xms4G -Xmx32G -cp sage_v3.3.jar com.hartwig.hmftools.sage.SageApplication \
    -threads 16 \
    -reference ${ref} -reference_bam ${ref_bam} \
    -tumor ${tum} -tumor_bam ${tum_bam} \
    -ref_genome_version 38 \
    -ref_genome ${fasta} \
    -hotspots ${hotspots} \
    -panel_bed ${actionable_coding} \
    -high_confidence_bed ${high_confidence} \
    -ensembl_data_dir ${ensembl_cache}/ \
    -out ${out_vcf} &> ${log}

and here is the log of the job

11:04:53.204 [INFO ] Sage version: 3.3
11:04:56.712 [INFO ] read 6195 panel entries from bed file: ../resources/hmf_dna_pipeline/v6_0/ref/38/variants/ActionableCodingPanel.38.bed.gz
11:04:57.130 [INFO ] read 9434 hotspots from vcf: ../resources/hmf_dna_pipeline/v6_0/ref/38/variants/KnownHotspots.somatic.38.vcf.gz
11:04:57.724 [INFO ] read 438100 high-confidence entries from bed file: ../resources/hmf_dna_pipeline/v6_0/ref/38/variants/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed.gz
11:04:57.907 [INFO ] writing to file: TUMOR_vs_NORMAL.vcf.gz
11:05:13.823 [INFO ] base quality recalibration cache generated
11:05:14.060 [INFO ] chromosome(chr1) executing 2490 regions
11:05:39.709 [INFO ] chromosome(chr1) analysis complete
11:05:39.871 [INFO ] chromosome(chr2) executing 2422 regions
11:05:59.872 [INFO ] chromosome(chr2) analysis complete
11:06:00.048 [INFO ] chromosome(chr3) executing 1983 regions
11:06:17.919 [INFO ] chromosome(chr3) analysis complete
11:06:18.126 [INFO ] chromosome(chr4) executing 1903 regions
11:06:31.947 [INFO ] chromosome(chr4) analysis complete
11:06:32.092 [INFO ] chromosome(chr5) executing 1816 regions
11:06:44.223 [INFO ] chromosome(chr5) analysis complete
11:06:44.380 [INFO ] chromosome(chr6) executing 1709 regions
11:07:00.434 [INFO ] chromosome(chr6) analysis complete
11:07:00.608 [INFO ] chromosome(chr7) executing 1594 regions
11:07:14.759 [INFO ] chromosome(chr7) analysis complete
11:07:14.903 [INFO ] chromosome(chr8) executing 1452 regions
11:07:25.951 [INFO ] chromosome(chr8) analysis complete
11:07:26.124 [INFO ] chromosome(chr9) executing 1384 regions
11:07:37.425 [INFO ] chromosome(chr9) analysis complete
11:07:37.581 [INFO ] chromosome(chr10) executing 1338 regions
11:07:49.636 [INFO ] chromosome(chr10) analysis complete
11:07:49.810 [INFO ] chromosome(chr11) executing 1351 regions
11:08:04.229 [INFO ] chromosome(chr11) analysis complete
11:08:04.420 [INFO ] chromosome(chr12) executing 1333 regions
11:08:17.085 [INFO ] chromosome(chr12) analysis complete
11:08:17.265 [INFO ] chromosome(chr13) executing 1144 regions
11:08:24.354 [INFO ] chromosome(chr13) analysis complete
11:08:24.545 [INFO ] chromosome(chr14) executing 1071 regions
11:08:24.706 [ERROR] thread execution error: java.lang.NullPointerException: Cannot read field "TransStart" because "x" is null
java.lang.NullPointerException: Cannot read field "TransStart" because "x" is null
    at com.hartwig.hmftools.sage.pipeline.RegionThread.lambda$createRegionTask$2(RegionThread.java:126)
    at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
    at com.hartwig.hmftools.sage.pipeline.RegionThread.createRegionTask(RegionThread.java:126)
    at com.hartwig.hmftools.sage.pipeline.RegionThread.run(RegionThread.java:88)

Can you please advise on the cause of the error and, if possible, on a fix? I have already processed successfully the data from these two samples (not exactly the BAMs above but BAMs further processed to get BQSR via GATK) with Mutect2 and Strelka.

Best regards, Y. Pradat

ypradat commented 2 days ago

Running with SAGE v4.0 (beta) returned the exact same error as did running with the exact same BAM files (GATK-recalibrated) as I provided to Mutect2 and Strelka.

Best regards, Y. Pradat

ypradat commented 1 day ago

Going through the java code, I understood that the error comes from the loading of transcripts data in the Ensembl cache. As outlined in my first command, I used homo_sapiens_core_111_38. Resorting to homo_sapiens_core_104_38 as indicated in the README runs without error. I suppose there has been some recent change in the Ensembl transcripts data that creates a loading error for some particular gene on chromosome 14.

Best regards, Y. Pradat

charlesshale commented 1 day ago

That's correct - the gene AL121790.1 has no transcripts and has since been removed from our Ensemble data cache as used in the next pipeline release (v6.0). I have in the meantime changed Sage to skip this gene when loading transcript info. This fix will be out in the Sage 4.0, also part of pipeline v6.0