ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
151 stars 17 forks source link

Possible issue with mate CIGAR string #18

Open TDDB-limagrain opened 2 years ago

TDDB-limagrain commented 2 years ago

Hi @ksahlin ! sorry for bothering you again!

I am trying to run GATK haplotypecaller on reads mapped with Strobealign. The Haplotypecaller ends with an error: java.lang.IllegalArgumentException: Sequence [VC HC @ Chr02:1-49670989 Q. of type=SYMBOLIC alleles=[C*, <NON_REF>] attr={END=49670989} GT=[[CB10 C*/C* GQ 0 DP 0 PL 0,0,0 {MIN_DP=0}]] filters= added out of order currentReferenceIndex: 0, referenceIndex:6 Besides I aligned the same reads with bwa mem and successfully ran the Haplotycaller. Therefore, I wondered whether something could have happened with the alignment formats.

I ran the Picard tool ValidateSamFile, giving me:

Strobealign:

ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE        341946
ERROR:MISMATCH_MATE_CIGAR_STRING        68204774

Bwa mem:

ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE        249950

I still need to dive into the details, especially to look for this kind of error: MISMATCH_MATE_CIGAR_STRING. That might also be a GATK-related issue... so I will keep you in touch with additional information.

(I finally succeeded in running Strelka on the same reads, thanks for adjusting alignment formatting! compared to array-based genotyping data, the combination strobealign+strelka performed very similarly to bwa mem + strelka, but of course at much higher speed for the mapping!).

TDDB-limagrain commented 2 years ago

here is an example of a MISMATCH_MATE_CIGAR_STRING:

output given by Picard ValidateSamFile:

ERROR::MISMATCH_MATE_CIGAR_STRING:Record 46, Read name A00604:215:HNL2GDSXY:2:2272:26576:36996, Mate CIGAR string does not match CIGAR string of mate
A00604:215:HNL2GDSXY:2:2272:26576:36996 99      Chr1    6006    0       147M4S  =       6010    147     CCCTATAATAGCTAAATATGCAAAAGCCACTTTTGTACAAGTAGATACACTTTTTTTGAGTAGACATAAGTTCTGTTTTGGGACCTAGAGAACGAATTTTCACGAGATTTTTTGCAGGGATCCTCATAAAAATGGTACAACATTGCCAGAT    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:F    AS:i:294        MQ:i:0  MC:Z:143=       ms:i:5107       MD:Z:147        NM:i:0     RG:Z:CB10
A00604:215:HNL2GDSXY:2:2272:26576:36996 147     Chr1    6010    0       143M    =       6006    -147    ATAATAGCTAAATATGCAAAAGCCACTTTTGTACAAGTAGATACACTTTTTTTGAGTAGACATAAGTTCTGTTTTGGGACCTAGAGAACGAATTTTCACGAGATTTTTTGCAGGGATCCTCATAAAAATGGTACAACATTGCC    FF,FFF,FFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    AS:i:286        MQ:i:0  MC:Z:147=4S     ms:i:5551       MD:Z:143        NM:i:0  RG:Z:CB10
ksahlin commented 2 years ago

Hi @TDDB-limagrain,

As for the MISMATCH_MATE_CIGAR_STRING, perhaps it is because the cigar reports 147M4S while the mate cigar string field reports MC:Z:147=4S (i.e., M in one and =/X in the other)? I think this is because you have converted strobealigns =/X cigars to M after it has added the field MC:Z:. Perhaps one can perform the translation of =/X after the cigar is modified, so that you would get MC:Z:147M4S? (if you need M cigars for downstream callers?)

It would be interesting to see what an example of ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE is.

As for the error in haplotype caller, I do not know. Do you think you could post an issue with the developers? (even though it is likely some misreporting in strobealign). Maybe they know what will cause this error. Looks like there is an issue that has to do with sorting of some kind.

ksahlin commented 2 years ago

Which program do you use to add on the fields MC:Z: ms:i: etc? Perhaps the haplotype caller error is caused by some invalid tag added by this program, much like the added mate cigar? I may add these fields within strobealign instead in the next version to remove such downstream step.

TDDB-limagrain commented 2 years ago

I am piping those following commands:

strobealign -t 4 $reffasta $forward $reverse | \
        samtools fixmate -u -m - - | \
        samtools addreplacerg -u -r "@RG\tID:${line}\tSM:${line}\tLB:Solution\tPL:illumina\tPU:none" - - |
        samtools sort  -u -T bam/$line - | \
        samtools markdup -@2 --reference $reffasta --write-index --output-fmt cram,version=3.0 - bam/$line.sorted.cram

The new fields probably came from the mark duplicate step http://www.htslib.org/doc/samtools-markdup.html.

These are the same commands I piped to bwa-mem and here are the same reads:

A00604:215:HNL2GDSXY:2:2272:26576:36996 163     Chr11   56372617        0       143M    =       56372617        147     GGCAATGTTGTACCATTTTTATGAGGATCCCTGCAAAAAATCTCGTGAAAATTCGTTCTCTAGGTCCCAAAACAGAACTTATGTCTACTCAAAAAAAGTGTATCTACTTGTACAAAAGTGGCTTTTGCATATTTAGCTATTAT      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFF,FFF,FFF,FF AS:i:143    XS:i:143 MQ:i:0  MC:Z:4S147M     ms:i:5551       MD:Z:143        NM:i:0  RG:Z:CB10
A00604:215:HNL2GDSXY:2:2272:26576:36996 83      Chr11   56372617        0       4S147M  =       56372617        -147    ATCTGGCAATGTTGTACCATTTTTATGAGGATCCCTGCAAAAAATCTCGTGAAAATTCGTTCTCTAGGTCCCAAAACAGAACTTATGTCTACTCAAAAAAAGTGTATCTACTTGTACAAAAGTGGCTTTTGCATATTTAGCTATTATAGGG      F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      AS:i:147        XS:i:147        MQ:i:0  MC:Z:143M       ms:i:5107       MD:Z:147        NM:i:0  RG:Z:CB10

That looks quite weird they are located on different chromosome! Bwa mem mapped it on Chr11 while strobealign mapped it on Chr1.

I had a look at this issue on the GATK github https://github.com/broadinstitute/gatk/issues/7199#issue-856655058, but no response from the developer so far.

ksahlin commented 2 years ago

So as for the MISMATCH_MATE_CIGAR_STRING I believe this is simply the =/X cigars conversion to M - so no problem there (unless some caller gets hung up on this warning). Maybe I'll implement to support output of M in addition to X/= cigars.

As for the map location: Both mates within the read pair are mapped with the same 'fit' (m1 perfect match and m2 perfect except 4 softclipps). They even have the same insert distance (template length 147). This is likely because of a perfect repeat where the location is assigned "at random", so I also don't see a potential issue here.

TDDB-limagrain commented 2 years ago

regarding the ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE, Picard ValidateSamFile returns for example:

ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 2259, Read name A00604:215:HNL2GDSXY:2:1373:14136:8938, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped

with the following details from the strobealign-related bam file:

A00604:215:HNL2GDSXY:2:1373:14136:8938  1097    Chr1    7107    0       151M    =       7107    0       CACCAACCATGTCAAGATCAATTCCTTTTGATCCCAAACTTCTCCCACATCACTTTGAAGGCATTTGTGAGGATTGAGTCCAGCCACTTTTCCAAAACCATGATTTTCGTTTTTCACCAAAATCACCACCAAGCCACAAAATCAAGCCAAA      FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFF:FF AS:i:302    MC:Z:*   ms:i:5587       MD:Z:151        NM:i:0  RG:Z:CB10
A00604:215:HNL2GDSXY:2:1373:14136:8938  133     Chr1    7107    0       *       =       7107    0       GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MQ:i:0  MC:Z:151=    ms:i:5478       RG:Z:CB10
TDDB-limagrain commented 2 years ago

Ok thanks for the anwser!

ksahlin commented 2 years ago

Ah, I see. This is also an error caused by the program adding the attributes to the bam file. For some reason, picard does not like that the cigar of the mateMC:Z:151= is added as a field/attribute of the unmapped read. No idea why this is undesired.

TDDB-limagrain commented 2 years ago

ok, seems that samtools command I add are introducing some confusing fields! I will try to troubleshoot the error I had with the Haplotypecaller because it seems to more a GATK related issue. Thanks!

marcelm commented 1 year ago

Hi, I’m going through some open strobealign issues and was wondering whether this one can be closed.

Note that in version 0.10.0, strobealign has switched to outputting M CIGAR operations by default instead of =/X, which may help here.