hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
187 stars 58 forks source link

TSO500 different reference #360

Closed davidvanzessen closed 1 year ago

davidvanzessen commented 1 year ago

Hello,

We're trying to run the panel tumor pipeline described here on our Illumina TSO500 data.

But after setting everything up, Amber immediately starts throwing this error when trying to load the sample BAM:

09:17:56.483 INFO  [main] - AMBER version: 3.9
09:17:56.485 INFO  [main] - Loading vcf file ./references//copy_number/GermlineHetPon.37.vcf.gz
09:17:59.351 INFO  [main] - loaded 1344545 baf loci
09:17:59.627 INFO  [main] - removed 0 blacklisted loci, 1344545 remaining
09:17:59.628 INFO  [main] - Processing 1344545 germline heterozygous loci in tumor bam ./sample_data/M22-3189T/M22-3189T.bam
09:17:59.628 INFO  [main] - Processing 0 germline homozygous loci in tumor bam ./sample_data/M22-3189T/M22-3189T.bam for contamination
09:18:00.869 INFO  [main] - 1344545 loci, 148821 genome regions, min gap = 4000
Exception in thread "worker-0" java.lang.IllegalArgumentException: Invalid reference index -1
    at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
    at com.hartwig.hmftools.amber.AsyncBamLociReader$BamReaderThread.run(AsyncBamLociReader.java:69)

Because the reference used in the official Illumina analysis pipeline is different than the one used in this pipeline, the names are different (which was easily fixed), but also the length of the chrM/MT contig is different.

We thought of a couple of solutions to this, simply aligning to the new reference (slow), removing the mitochondrial reads and modifying the header (could be great), or replacing the reference files in this pipeline (not sure if the pipeline will accept this?).
We've spent some time on trying out modifying the BAM file, but so far tools kept tripping over it and it's something that's usually not recommended...

Since the page specifically describes TSO500 as en example, we were wondering how this was done,
Platinum is mentioned on the page so we assume the FastQ were used as a starting point, skipping the described problems, but double checking if there isn't an existing other solution.

Thank you in advance, if you need any more info let us know.

p-priestley commented 1 year ago

We have run smoothly on TSO500 samples from FASTQ, after aligning to our own pipeline version of the genome available from our resource page.

if you are starting from BAM, you should definitely use the same reference genome that you aligned with. I believe that should work, but I can't guarantee all configurations. If it does not work, I recommend just starting from FASTQ. Mixing ref genomes is a bad idea and can lead to lots of different problems.

davidvanzessen commented 1 year ago

Thank you for the quick response, we will replace the reference genome with ours and try the pipeline again.

davidvanzessen commented 1 year ago

Hello,

Thought it might be useful to let people know that the proposed solution worked, and share what we did to get it work.

Here is the Conda environment that this was tested in:

name: hmf_tso500
channels:
 - conda-forge
 - bioconda
dependencies:
 - openjdk=11.0.13
 - samtools
 - bwa=0.7.17
 - r-base=4.2.2
 - r-dplyr=1.0.10
 - bioconductor-copynumber=1.38.0
 - gsutil
 - picard=2.27.4
 - gatk4=4.3.0.0
 - gridss=2.13.2
 - circos

Not all of this is probably needed, but whatever, install with:
conda/mamba env create -f environment.yml
conda activate hmf_tso500

Symlink the circos script in the conda env to the tools directory:
ln -s /path/to/conda/envs/hmf_tso500/bin/circos tools/circos

Circos was complaining about to many data points, so increase max_points_per_track in /path/to/conda/envs/molpath_hmf/etc/housekeeping.conf to some higher number, not sure if this is needed for every sample.
Maybe HMF can comment on this?

Because the names of the regions used in the TSO500 reference are still different (chr1 instead of 1), just copy the fasta reference file from TSO500_RUO_LocalApp/resources/genomes/<genome>/genome.fa and then rename the regions:
sed -i 's/>chr/>/g' references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta

And we will create the rest of the files from that:

# dict
picard CreateSequenceDictionary -R references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta -O references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.dict
# fasta.dict
cp references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.dict references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta.dict

# fasta.fai
samtools faidx references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta

# fasta.img
# not sure this is needed if starting from BAM
gatk BwaMemIndexImageCreator -I references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta -O references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta.img

# fasta.ann
# fasta.amb
# fasta.bwt
# fasta.pac
# fasta.sa
bwa index references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta

Hope this helps someone :)