samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

reverseComplement of SAMRecord does not inverse CigarString #1375

Closed RobinVanSchendel closed 5 years ago

RobinVanSchendel commented 5 years ago

When I use the function reverseComplement function of SAMRecord, the CigarString is not reversed.

OS: Windows Java: 1.8 htsjdk: latest trunk version 24-05-2019

for example:

M02948:128:000000000-BDPMB:1:1107:17802:6872 145 pCAMBIA3301_GUIDE 8629 60 73S78M 1 27553378 0 ATGGCCCAATAGGCAGCTAGCAAAAAACGGTAAATCTGCAAAAGCAACCTTGTATACATATATTTTTCTCAAAGATATATTGTGGTGTAAACAAATTGACGCTTAGACAACTTAATAACACATTGCGGACGTTTTTAATGTACTGAATTAA 0//0/FD1B1B01BF1GB1F@/FCF??/12121D1FHGHHF1F00EB/GHHGB22@22FFBB0B1D//21GF22D2GFGACB/GBBFDF0A2F2GE/?EEBFDFFDDBBFD2FBHDBGFA100EE?FFB1A11BGGCGGFFFFBBFA1AAA SA:Z:1,27553704,-,73M78S,60,1; MC:Z:151M MD:Z:78 PG:Z:MarkDuplicates RG:Z:A NM:i:0 AS:i:78 XS:i:0

becomes

M02948:128:000000000-BDPMB:1:1107:17802:6872 145 pCAMBIA3301_GUIDE 8629 60 73S78M 1 27553378 0 TTAATTCAGTACATTAAAAACGTCCGCAATGTGTTATTAAGTTGTCTAAGCGTCAATTTGTTTACACCACAATATATCTTTGAGAAAAATATATGTATACAAGGTTGCTTTTGCAGATTTACCGTTTTTTGCTAGCTGCCTATTGGGCCAT AAA1AFBBFFFFGGCGGB11A1BFF?EE001AFGBDHBF2DFBBDDFFDFBEE?/EG2F2A0FDFBBG/BCAGFG2D22FG12//D1B0BBFF22@22BGHHG/BE00F1FHHGHF1D12121/??FCF/@F1BG1FB10B1B1DF/0//0 SA:Z:1,27553704,-,73M78S,60,1; MC:Z:151M MD:Z:78 PG:Z:MarkDuplicates RG:Z:A NM:i:0 AS:i:78 XS:i:0

is that expected behaviour?

yfarjoun commented 5 years ago

revering the CIGAR doesn't really make sense (to me) since the alignment of the record is not the same once the record has been reverse complemented.....the typical usage of reverse complementing a record is when you align it or unalign it and in that case you'll either be removing the CIGAR altogether, or putting in the correct CIGAR.

Can you perhaps explain what your usecase is?

RobinVanSchendel commented 5 years ago

I wrote a script that detects insertions of exogenous DNA in genomes. However I always want the insertion to be displayed from the exogenous DNA side. Therefore I take the reverse complement of those reads if needed. Then I annotate the event further by showing the length of the inserted exogenous DNA and the junction between the exogenous DNA and the reference DNA. Therefore I need the length of the matched part of the Cigar. I specifically have reads that are of the form MS, but also SM. From the last ones I take the reverse complement of the SAMRecord, but then erreonously assumed the cigar is also reversed. If that is not supposed to happen that is also fine and I will do it in my code.

I still think the cigar should match the current DNA and now they become out of sync. Therefore I was wondering if this is expected behaviour.

yfarjoun commented 5 years ago

yeah...you'll have to reverse the cigar on your own...

jmarshall commented 5 years ago

Considering that the function does reverse or reverse-complement well-known or user-specified tag values, I'm surprised it does not also reverse CIGAR and flags.