psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

fix ig-sw cigar query length mismatch #211

Closed psathyrella closed 6 years ago

psathyrella commented 8 years ago
>foop
TGCAGCCTCTGGTTTCTCTTTTATTTACGCCTGGATGCACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCGGCCGTATTAAAAGTAAAGCTGGTGGTGGGACAATAGACTACGCTGCACCCGTGGAAGACAGATTCACCATCTCAAGAGATGATTCAAAAAATACATCGCCGTGTCGTCGGCTCTCAAGTTGTCCATTTGCAGAAATAACATGTTCTTGGAATTGTCTCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGTGGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATCAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAGAGATGGGGGGGCGTATGATAGTAGTGGTTACTACTACTACTACTACGGTATGGACGTCTGGGGCCAAGGGACCAC

this sequence causes pysam to barf:

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 340

googling it yields a bunch of pysam bugfixes, so I'm leaning that direction, but still unsure.

dawahs commented 7 years ago

the same error obtained with the following sequence

>foopjr
TTGAATACGGGGGCTTCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGCCTCCATCAGTGGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGAGACTGGAGTGGATTGGGTATTTCTATTACACTGGGAGCACCGACTACAACCCCTCGCTCAAGAGTCGAGTCACCATATCGGTAGACACGTNCAAGAACCAGTTCTNCCTGAAGCTGAGNNCNNNNNNNNNGGNNNNNNNNNNNNNNNNNGANGNNNNNNNNNNNNNNNNNNNNNNNNNNNNACNCTGGGNNCACCGACTACAACCCCTCGCTCAAGAGTCGAGTCACCATATCGGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCNCTGTGACCGCNGCNGACACGGCNGTGTATTACTGTGCGAGTCAGTGGTTATTANAATTTGATGCTTTNGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTGGCCGTTGGCTGCCTCGCACAGGACTTCNNNN
psathyrella commented 7 years ago

OK, time to fix this, it's turrning up a bunch in some new data. And I just realized I don't think it's a pysam problem: pysam complains the cigar and query sequence are different lengths. And it's right -- in the .sam output by ig-sw the cigar and and query are different lengths. OK, time to dig into ig-sw.

matsen commented 7 years ago

Huh. Is it worth trying it with ighutil?

On the bright side, there's a lot less code in ig-sw than in ighutil...

psathyrella commented 7 years ago

That might help narrow it down, but we want to stick with ig-sw in the long run, so we need to fix it there anyway. I started digging today but didn't really spend that long. When someone had this issue before, it was solved avoided by telling them to trim their hugely long (like 550 bases or something) reads, so it's probably/possibly some overflow of some sort.

matsen commented 7 years ago

Yeah, I was just thinking about narrowing down. I really think they are functionally equivalent though, but it could be worth checking.

psathyrella commented 7 years ago

lengths appear to be off by the time we get to here:

https://github.com/matsengrp/ig-sw/blob/master/src/ig_align/ig_align.c#L187

although they appear off in more cases than we actually crash. I think this is just because it's more likely to be off in crappy matches, which most of the time we don't use. Suspect problem is in actual alignment function:

https://github.com/matsengrp/ig-sw/blob/master/src/ig_align/ksw.c#L474

but hard to tell, as author seems to believe that single-character variable names save memory. ragesighragecomplain.

psathyrella commented 7 years ago

managed to set the alignment's cigar to null in cases where the cigar and query lengths don't match:

https://github.com/psathyrella/partis/commit/7486683dc1c9dd45cf5bbf4ebc68720d0259d09e#diff-4b469f759cf05065762d5ec03c727414R197

such that the matches will be skipped. This definitely doesn't solve the underlying problem, and it's worrisome how frequently ig-sw kicks out alignments with cigar and query sequences that aren't the same length. Nonetheless, this seems to work for now since it seems to only affect crappy matches.

relevant commits:

https://github.com/matsengrp/ig-sw/commit/20625155be082fb2840c0454b44703e9f6ef02d6 https://github.com/matsengrp/ig-sw/commit/9310c4a070ddf9cd25f8ded8156320229680251f https://github.com/matsengrp/ig-sw/commit/bf85797c7843a0c82f52317d10fd4d3c28f8cc75

psathyrella commented 7 years ago

oh *#@(!$ nvmd, worked fine in testing but in production this fix seg faults like a little *#@(!(!.

psathyrella commented 7 years ago

ok, this seems to be a functional short-term fix:

https://github.com/matsengrp/ig-sw/commit/eccfa2a13a019b4ab64f895326ee41e47379df26

psathyrella commented 7 years ago

maybe possibly actually semi-long-termedly avoided: https://github.com/psathyrella/partis/commit/53755f59c8167558ed928ec1f1979cc5583e5eae

still scary that it's kicking out cigars that aren't the same length as the read, but it seems to only happen with matches that are poor enough quality that we'll ignore them.

psathyrella commented 7 years ago

so the last round of changes apparently didn't fix this! damn. Example failure at end of comments for #225.

psathyrella commented 7 years ago
[ig_align] Error: match length mismatch for query 00000141uc score 640 target IGHV3-30*18: qe - qb = 441 - 4 = 437 != match_length - 1 = 423

    00000141uc   0    IGHV3-30*18   23   40   4S238M35I34M115I2M  TGGTGGGGAGACGTGTTCCAGCCTGGGAGATCCCTGATACTCTCCTGTAAAGCCTTTTAAATAAACTACATAGAATATTCTATACATTGGGTCCAACAAGATCCAGGCAATGGGCTATATTTGGGTACGGTTATATCATTAAATGGACAAAGTAAATACTACTCAGACCACGTGAAGGTACGATGCACCATATACGTCTACAATTCAGAGAACACACACTACCTGACATTCAACATAATTAGCGGAGTAACAGCGACTAGAACAGCCAAGGAAGAATGGAAGAGAAGGTGTAAGTTTAATACTATGATAGAAGAAATAAATAATACAGAAACTACGTGCAGAGCAGATTTACATTCTTCCTAGACAATTACATGAACACGATGGATGTGAAAATGAACAGCCTGAGATCCGGAGACACGTATGTGTATTAATGTGCGAGAGATCTGGGCAGCTCGTACGACTTTGACTAATGGGGCCAGGGAACACTGGTCACCGTCTCCTCAGCCTCCACC

another example from krdav