Open TDDB-limagrain opened 2 years ago
here is an example of a MISMATCH_MATE_CIGAR_STRING:
output given by Picard ValidateSamFile:
ERROR::MISMATCH_MATE_CIGAR_STRING:Record 46, Read name A00604:215:HNL2GDSXY:2:2272:26576:36996, Mate CIGAR string does not match CIGAR string of mate
A00604:215:HNL2GDSXY:2:2272:26576:36996 99 Chr1 6006 0 147M4S = 6010 147 CCCTATAATAGCTAAATATGCAAAAGCCACTTTTGTACAAGTAGATACACTTTTTTTGAGTAGACATAAGTTCTGTTTTGGGACCTAGAGAACGAATTTTCACGAGATTTTTTGCAGGGATCCTCATAAAAATGGTACAACATTGCCAGAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:F AS:i:294 MQ:i:0 MC:Z:143= ms:i:5107 MD:Z:147 NM:i:0 RG:Z:CB10
A00604:215:HNL2GDSXY:2:2272:26576:36996 147 Chr1 6010 0 143M = 6006 -147 ATAATAGCTAAATATGCAAAAGCCACTTTTGTACAAGTAGATACACTTTTTTTGAGTAGACATAAGTTCTGTTTTGGGACCTAGAGAACGAATTTTCACGAGATTTTTTGCAGGGATCCTCATAAAAATGGTACAACATTGCC FF,FFF,FFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:286 MQ:i:0 MC:Z:147=4S ms:i:5551 MD:Z:143 NM:i:0 RG:Z:CB10
Hi @TDDB-limagrain,
As for the MISMATCH_MATE_CIGAR_STRING
, perhaps it is because the cigar reports 147M4S
while the mate cigar string field reports MC:Z:147=4S
(i.e., M
in one and =/X
in the other)? I think this is because you have converted strobealigns =/X
cigars to M
after it has added the field MC:Z:
. Perhaps one can perform the translation of =/X
after the cigar is modified, so that you would get MC:Z:147M4S
? (if you need M cigars for downstream callers?)
It would be interesting to see what an example of ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE
is.
As for the error in haplotype caller, I do not know. Do you think you could post an issue with the developers? (even though it is likely some misreporting in strobealign). Maybe they know what will cause this error. Looks like there is an issue that has to do with sorting of some kind.
Which program do you use to add on the fields MC:Z:
ms:i:
etc? Perhaps the haplotype caller error is caused by some invalid tag added by this program, much like the added mate cigar? I may add these fields within strobealign instead in the next version to remove such downstream step.
I am piping those following commands:
strobealign -t 4 $reffasta $forward $reverse | \
samtools fixmate -u -m - - | \
samtools addreplacerg -u -r "@RG\tID:${line}\tSM:${line}\tLB:Solution\tPL:illumina\tPU:none" - - |
samtools sort -u -T bam/$line - | \
samtools markdup -@2 --reference $reffasta --write-index --output-fmt cram,version=3.0 - bam/$line.sorted.cram
The new fields probably came from the mark duplicate step http://www.htslib.org/doc/samtools-markdup.html.
These are the same commands I piped to bwa-mem and here are the same reads:
A00604:215:HNL2GDSXY:2:2272:26576:36996 163 Chr11 56372617 0 143M = 56372617 147 GGCAATGTTGTACCATTTTTATGAGGATCCCTGCAAAAAATCTCGTGAAAATTCGTTCTCTAGGTCCCAAAACAGAACTTATGTCTACTCAAAAAAAGTGTATCTACTTGTACAAAAGTGGCTTTTGCATATTTAGCTATTAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFF,FFF,FFF,FF AS:i:143 XS:i:143 MQ:i:0 MC:Z:4S147M ms:i:5551 MD:Z:143 NM:i:0 RG:Z:CB10
A00604:215:HNL2GDSXY:2:2272:26576:36996 83 Chr11 56372617 0 4S147M = 56372617 -147 ATCTGGCAATGTTGTACCATTTTTATGAGGATCCCTGCAAAAAATCTCGTGAAAATTCGTTCTCTAGGTCCCAAAACAGAACTTATGTCTACTCAAAAAAAGTGTATCTACTTGTACAAAAGTGGCTTTTGCATATTTAGCTATTATAGGG F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:147 XS:i:147 MQ:i:0 MC:Z:143M ms:i:5107 MD:Z:147 NM:i:0 RG:Z:CB10
That looks quite weird they are located on different chromosome! Bwa mem mapped it on Chr11 while strobealign mapped it on Chr1.
I had a look at this issue on the GATK github https://github.com/broadinstitute/gatk/issues/7199#issue-856655058, but no response from the developer so far.
So as for the MISMATCH_MATE_CIGAR_STRING
I believe this is simply the =/X
cigars conversion to M
- so no problem there (unless some caller gets hung up on this warning). Maybe I'll implement to support output of M in addition to X/=
cigars.
As for the map location: Both mates within the read pair are mapped with the same 'fit' (m1 perfect match and m2 perfect except 4 softclipps). They even have the same insert distance (template length 147). This is likely because of a perfect repeat where the location is assigned "at random", so I also don't see a potential issue here.
regarding the ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE
, Picard ValidateSamFile returns for example:
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 2259, Read name A00604:215:HNL2GDSXY:2:1373:14136:8938, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
with the following details from the strobealign-related bam file:
A00604:215:HNL2GDSXY:2:1373:14136:8938 1097 Chr1 7107 0 151M = 7107 0 CACCAACCATGTCAAGATCAATTCCTTTTGATCCCAAACTTCTCCCACATCACTTTGAAGGCATTTGTGAGGATTGAGTCCAGCCACTTTTCCAAAACCATGATTTTCGTTTTTCACCAAAATCACCACCAAGCCACAAAATCAAGCCAAA FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFF:FF AS:i:302 MC:Z:* ms:i:5587 MD:Z:151 NM:i:0 RG:Z:CB10
A00604:215:HNL2GDSXY:2:1373:14136:8938 133 Chr1 7107 0 * = 7107 0 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MQ:i:0 MC:Z:151= ms:i:5478 RG:Z:CB10
Ok thanks for the anwser!
Ah, I see. This is also an error caused by the program adding the attributes to the bam file. For some reason, picard does not like that the cigar of the mateMC:Z:151=
is added as a field/attribute of the unmapped read. No idea why this is undesired.
ok, seems that samtools command I add are introducing some confusing fields! I will try to troubleshoot the error I had with the Haplotypecaller because it seems to more a GATK related issue. Thanks!
Hi, I’m going through some open strobealign issues and was wondering whether this one can be closed.
Note that in version 0.10.0, strobealign has switched to outputting M
CIGAR operations by default instead of =
/X
, which may help here.
Hi @ksahlin ! sorry for bothering you again!
I am trying to run GATK haplotypecaller on reads mapped with Strobealign. The Haplotypecaller ends with an error:
java.lang.IllegalArgumentException: Sequence [VC HC @ Chr02:1-49670989 Q. of type=SYMBOLIC alleles=[C*, <NON_REF>] attr={END=49670989} GT=[[CB10 C*/C* GQ 0 DP 0 PL 0,0,0 {MIN_DP=0}]] filters= added out of order currentReferenceIndex: 0, referenceIndex:6
Besides I aligned the same reads with bwa mem and successfully ran the Haplotycaller. Therefore, I wondered whether something could have happened with the alignment formats.I ran the Picard tool ValidateSamFile, giving me:
Strobealign:
Bwa mem:
I still need to dive into the details, especially to look for this kind of error: MISMATCH_MATE_CIGAR_STRING. That might also be a GATK-related issue... so I will keep you in touch with additional information.
(I finally succeeded in running Strelka on the same reads, thanks for adjusting alignment formatting! compared to array-based genotyping data, the combination strobealign+strelka performed very similarly to bwa mem + strelka, but of course at much higher speed for the mapping!).