lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
482 stars 133 forks source link

Wrong results when original reads contains soft clipped nucleotides #81

Open chadisaad opened 7 years ago

chadisaad commented 7 years ago

When the original reads contains already soft clipped sequences, final results are wrong, sometimes we have a shift of positions, and some times final reads are shorter then expected

lindenb commented 7 years ago

thanks for the report. Please, provide a minimal SAM file and a BED file

chadisaad commented 7 years ago

Please found the 2 SAM files corresponding to the 2 bugs:

jvarkit-bugs.zip

lindenb commented 7 years ago

thanks for the report, I'll try to fix this in the next days....

lindenb commented 7 years ago

@chadisaad I've changed a few things. Can you test it please ?

lindenb commented 7 years ago

@chadisaad ( later ) forget about it, the current code is till buggy, i need to work on this

lindenb commented 7 years ago

Ok, I've rewrittern some things https://github.com/lindenb/jvarkit/commit/50c95549a1a68ee34a9c9d168723172febd90cd2 . Can you please test this ?

chadisaad commented 7 years ago

thanks for fast reply. But I have another bug now, it says
[E::sam_parse1] CIGAR and query sequence are of different length [W::sam_read1] parse error at line 319 [main_samview] truncated file.

I think is due to one of this 3 sequences (lines 318 -> 320 in the SAM file):

IT5KJ:03103:08393       0       chr1    14724   1       4M1D16M1I171M   *       0       0       CCTGCTGTGGCTGCTGCGGTCGCCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCTTTTCCCAGAGATGCCTGGAGGGAA
AAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCCCCGGAGACTTAAATACAGGAAGAAAAAGG        80////8/5/5;:5/-882----)-29555:5:;>4;57159::;::;<<;:::3:;<4<188:48707;2999/664:253333'33-35:990;<6::<=<9984991444'455*5599998889
89198396<7?;;1::1::+..5:5:>@;*855555&3166:;;588/54765(-(-----&-0        XA:Z:map4-1     ZA:i:193        ZB:i:30 ZC:B:i,317,316,1        MD:Z:0T3^C17G68C100     ZF:i:28 PG:Z:tmap       RG:Z:IT5KJ.IonXpress_012
        ZG:i:317        NM:i:5  ZM:B:s,308,-56,266,-40,-36,250,-22,244,260,104,26,-8,168,442,28,-38,200,-32,506,34,480,240,228,308,292,-2,-12,0,414,12,30,36,264,-14,64,168,8,88,374,18,288,84,-12,264,40,-16,88,34,
250,54,400,-24,-18,220,-24,-12,234,238,-68,8,278,-2,28,50,186,62,-46,252,-32,-6,282,480,212,114,56,368,18,0,416,74,2,30,490,6,-2,262,4,214,26,168,-40,234,102,64,458,4,-12,208,72,686,28,282,-16,164,226,-52,-32,368
,300,234,-8,12,270,214,24,32,204,34,250,254,8,-10,-50,-6,274,238,24,204,94,20,14,276,26,26,304,602,-12,206,-24,34,710,424,186,20,222,394,92,10,272,392,742,62,-2,60,32,0,864,116,128,52,228,86,350,28,14,218,526,100
,-18,100,28,38,272,172,800,244,-22,16,490,54,180,250,30,20,772,36,-12,-32,746,230,294,-22,130,212,120,204,102,204,66,46,202,-66,-26,168,110,96,376,96,242,-30,422,222,50,64,536,692,28,428,2,58,218,32,124,12,256,12
2,-44,340,14,242,24,266,194,-28,-66,218,382,600,64,18,290,12,414,130,380,18,508,100,236,636,10,586,584,-20,182,12,316,24,38,494,450,108,1010,80,80,64,-12,132,354,154,988,-22,128,176,424,272,54,156,246,222,52,12,-
8,78,448,2,142,4,366,700,0,116,138,272,266,26,76,10,-38,-8,350,348,52,328,92,8,198,1082,18,76,338,102,34,138,-30,96,246,182,60,26,198,354,154,174,4,242     ZP:B:f,0.00514424,0.00442025,8.16032e-06        AS:i:165
        XS:i:161
IT5KJ:04801:03706       0       chr1    14724   1       193M    *       0       0       GCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCTGGAGGGAAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCCCCGGAGACTTAAATACAGGAAGAAAAAGGC       /,,,/48788488;=><>>9>>9>=;>?=??;<<<7<;;8<<<;::<<==<=B?9===8=9>==8=?9==9B==5<<=<7<8877.7:6;=8<<7<<6<<<<====5::8==?7===5=4;=?<<@>99949>8@8=8==?5::3:93:=8=8====1<<<<<<1<9>====9==677<=>6;666566'4.5       XA:Z:map4-1     ZA:i:193        ZB:i:30 ZC:B:i,317,316,1        MD:Z:0T192      ZF:i:25 PG:Z:tmap       RG:Z:IT5KJ.IonXpress_012        ZG:i:317
        NM:i:1  ZM:B:s,252,6,226,-24,-28,268,-38,256,258,82,78,-22,230,528,32,-30,212,-46,540,76,484,216,216,266,276,136,4,24,242,8,92,-50,178,32,-8,344,-12,30,502,12,220,92,10,288,-28,-8,-20,-4,200,116,504,-8,2,296,52,-2,246,254,-28,-32,270,-4,-2,2,280,2,-20,258,22,-8,236,484,244,50,24,462,0,18,280,20,32,0,520,-2,28,244,44,208,56,228,-2,252,0,-2,506,-26,12,246,40,746,-14,258,12,146,170,8,-14,440,246,246,-62,0,210,186,34,-6,164,30,238,234,54,12,-26,58,238,212,18,232,8,-16,34,272,0,24,230,674,-14,240,-14,-12,658,436,262,6,212,472,54,4,210,478,708,70,16,12,0,-10,884,106,78,0,254,-8,294,-2,48,186,546,38,34,32,20,-34,214,232,822,242,58,76,394,32,212,464,66,-18,588,-4,0,-32,784,250,252,-4,64,184,104,190,32,222,58,28,198,16,-14,178,82,66,374,68,162,-26,476,256,38,38,618,752,-6,472,-2,46,198,2,118,-20,240,90,-10,298,8,234,-14,338,230,-26,6,234,330,708,34,22,242,-2,452,12,394,12,520,116,262,648,124,556,546,-2,140,12,228,24,4,524,440,86,1026,110,82,24,-28,146,322,226,1100,18,98,118,464,216,62,192,226,212,8,118,-8,-10,472,24,144,30,382,670,-18,102,138,234,246,6,100,-42,-10,-20,362,392,2,380,50,8,210,1104,0,140,332,104,0,104,-10,56,212,200,70,-6,190,322,182,78,-4,218,492,186,-12,6,-6,12      ZP:B:f,0.0027851,0.00455457,0.000250536 AS:i:189        XS:i:185
IT5KJ:00240:06364       0       chr1    14725   1       91M6I11M1I11M1D58M1I20M *       0       0       CTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCAGGCCCCTCACCAGCCCCAGGTCCGGGAGGTTTCCAGCAGACTGCCTGGAGGGAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCCCCGGAGACTTAAAATACAGGAAGAAAAAGGC <:;;9<<==9>@====<?9==<=>9=<<<<7<<<6<<<8=888<<<<<;;;==8===6<5;<<5;;7<<6<<<2<<<<8<<>@@6==798)..)..)..(.36...955533-39397;)11(1+15<<=<;:99276/6/5.588/::1:>7==8=5:999+;88888+94777::/777-6889828455555&448 XA:Z:map4-1     ZA:i:199        ZB:i:30 ZC:B:i,317,316,1        MD:Z:67G28C0A0G14^A78   ZF:i:28 PG:Z:tmap       RG:Z:IT5KJ.IonXpress_012        ZG:i:317        NM:i:13 ZM:B:s,296,-12,242,42,-2,240,20,244,284,-4,0,-4,212,502,10,10,254,-10,504,-36,518,216,154,250,244,8,32,32,272,26,22,8,224,4,40,276,-12,-36,474,-16,266,12,-2,290,18,16,14,20,254,10,464,-54,44,238,22,-4,232,226,34,-26,278,0,0,20,244,18,12,220,6,-30,254,504,254,6,12,460,2,12,250,32,-30,-4,488,24,50,254,58,210,-36,202,-34,258,-22,-18,532,-8,-24,212,12,744,-38,264,-28,90,230,12,-2,420,238,236,12,56,160,198,94,20,194,32,266,210,18,-14,4,0,232,284,16,264,0,-24,22,306,0,6,238,660,-28,202,6,24,672,496,270,20,180,462,36,40,154,490,92,184,2,424,-8,0,872,116,72,50,220,20,250,2,42,208,514,28,4,86,68,-2,222,278,946,234,-2,34,412,2,218,534,558,268,-68,540,674,68,590,206,298,130,72,220,56,340,94,232,174,-4,246,96,8,170,90,74,368,58,184,-12,414,240,0,20,604,668,-2,438,-38,60,212,0,54,10,234,66,0,306,-14,216,6,360,196,4,4,234,370,706,22,4,234,8,414,40,390,-22,510,24,284,598,28,564,604,2,186,26,224,22,-2,516,464,94,992,110,90,36,-2,152,304,232,1126,-4,-4,40,466,240,24,172,162,260,-40,76,16,10,476,-14,92,38,396,798,-8,112,172,226,234,-12,122,8,38,4,362,426,60,366,52,44,204,1068,-34,74,424,24,0,108,10,98,242,214,98,18,150,284,220,138,8,288,530,294,-12,58,62,36   ZP:B:f,0.00478342,0.00426417,3.21881e-05        AS:i:137        XS:i:133
lindenb commented 7 years ago

can you please send those problematic lines in attachment, I cannot copy+paste them without an error. Thanks.

chadisaad commented 7 years ago

Please find it in attachement. bug.sam.zip

lindenb commented 7 years ago

thanks. Works here after https://github.com/lindenb/jvarkit/commit/0efc15763059f91d61141473affa79fd04de8aba

[$ java -jar dist/pcrclipreads.jar -B input.bed bug.sam  | samtools view -S -out.bam -
[INFO][PcrClipReads]reading bed File jeter.bed
[INFO][SAMSequenceDictionaryProgress]done: N=3

Can you try again please?