luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
299 stars 37 forks source link

Weird contig issue #220

Closed salpie closed 2 years ago

salpie commented 2 years ago

Hi there,

I ran Octopus in the cancer only mode and got this error:

terminate called after throwing an instance of 'octopus::BadRegionCompare'
  what():  BadRegionCompare: comparing regions on different contigs: :9717512-9717561 & chr1:9717512-9717561

i'm not sure what's happening as i've aligned and called mutations with the same reference and both have a chr present ( in the header and reads in the bam file) and in the whole .fa file.

bam file header snippet:


@HD VN:1.6  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
@SQ SN:chr11_KI270721v1_random  LN:100316
@SQ SN:chr12    LN:133275309
@SQ SN:chr13    LN:114364328
@SQ SN:chr14    LN:107043718

the offending region:


A00176:462:H23HNDMXY:1:1474:30092:5353  129 chr1    9717512 60  49M =   9717272 -241    ACAGCCCTGCTTCCCCGGCCCCCAGCCTTGCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:1  MD:Z:29C19  MC:Z:49M    AS:i:44 XS:i:20 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:1:1102:18105:7044  129 chr1    9717512 60  49M =   9717486 -27 ACAGCCCTGCTTCCCCGGCCCCCAGCCTTCCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:2A46   MC:Z:37M12S AS:i:46 XS:i:19 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:1:1218:16396:8625  163 chr1    9717512 60  49M =   9717546 83  ACAGCCCTGCTTCCCCGGCCCCCAGCCTTGCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:1  MD:Z:29C19  MC:Z:49M    AS:i:44 XS:i:20 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:1:2183:31629:4413  99  chr1    9717512 60  49M =   9717558 95  ACAGCCCTGCTTCCCCGGCCCCCAGCCTTCCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:49 MC:Z:49M    AS:i:49 XS:i:19 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:1:1463:25165:13166 163 chr1    9717512 60  49M =   9717570 107 ACAGCCCTGCTTCCCCGGCCCCCAGCCTTCCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:49 MC:Z:49M    AS:i:49 XS:i:19 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:2:1147:22453:33207 163 chr1    9717512 60  49M =   9717607 144 ACAGCCCTGCTTCCCCGGCCCCCAGCCTTGCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:1  MD:Z:29C19  MC:Z:49M    AS:i:44 XS:i:20 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:1:1107:30418:17033 163 chr1    9717512 60  49M =   9717657 194 ACAGCCCTGCTTCCCCGGCCCCCAGCCTTGCTCTGTGTCCCTGTGGTCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:1  MD:Z:29C19  MC:Z:49M    AS:i:44 XS:i:20 RG:Z:F51_1038_36_BP2_T4_S36
A00176:462:H23HNDMXY:1:2201:19325:24565 99  chr1    9717513 60  49M =   9717541 77  CAGCCCTGCTTCCCCGGCCCCCAGCCTTCCTCTGTGTCCCTGTGGTCCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:49 MC:Z:49M    AS:i:49 XS:i:19 RG:Z:F51_1038_36_BP2_T4_S36

I'm using ## Version 0.7.4 and this was the command:

$octopus_0_7_4 -R ${refgenome} --threads 1 --annotations AF -I ${results_dir}/finalbams/${patient}.bam -C cancer -o ${results_dir}/octopus/${patient}.vcf

the reference used was GRCh38/ucsc/hg38.fa

dancooke commented 2 years ago

Thanks for the bug report. Can you try calling your data around this region (add -T chr1:9,713,165-9,721,858). If you get the same error, would you be able to provide a BAM subsetted at this region?

salpie commented 2 years ago

Thanks for the swift response. Sadly, same error. Here's an edited region (I've changed the bases and it's in sam not bam - hope that's okay) output.sam.zip )

dancooke commented 2 years ago

How was this data generated? The problem is caused by read pairs with duplicate names:

samtools view output.bam | grep "A00176:462:H23HNDMXY:2:2180:29324:1235"
A00176:462:H23HNDMXY:2:2180:29324:1235  97  chr1    9717513 60  49M chr12   107518128   0   AAGAAATGATTAAAAGGAAAAAAGAATTAATATGTGTAAATGTGGTAAA   FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFF   NM:i:0  MD:Z:49 MA:Z:49M    AS:i:49 XS:i:19 RG:Z:sample1
A00176:462:H23HNDMXY:2:2180:29324:1235  97  chr1    9717513 60  49M chr2    31183704    0   AAGAAATGATTAAAAGGAAAAAAGAATTAATATGTGTAAATGTGGTAAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:49 MA:Z:49M    AS:i:49 XS:i:19 RG:Z:sample1

Read names in an Illumina library should be unique.

salpie commented 2 years ago

Ah yes, the duplicate read names was an issue we found in the fastqs to begin with (likely a genome centre bcl2fastq error) - and have been removing them, but it appears one escaped. Thank you for finding out what was wrong!