timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
118 stars 25 forks source link

collapsing reads issue follow-up #56

Open rtyags opened 1 year ago

rtyags commented 1 year ago

Hi, please look at the following comment from a closed issue. Opening a new issue here since I haven't heard back from anyone (presumably because commenting on a closed issue doesn't automatically reopen it).

Thanks.

" As a follow up, looking at the code it seems to me that you use 20 as the threshold for this. i.e. if one end is the same, we allow the other end to be up to 20 bases away for it to still be considered a duplicate. Is that correct?

However, even in that case, I'm confused because I see multiple cases where the end is the same, the start is <20 bases away, but these are still not counted separately (i.e., they are considered duplicates) by sinto. e.g. with the following 4 reads:

A00261:525:HK77VDSX3:1:1133:17969:2613 99 chrM 9947 60 150M = 10023 226 GGTTTGACTATTTCTGTATGTCTCCATCTATTGATGAGGGTCTTACTCTTTTAGTATAAATAGTACCGTTAACTTCCAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:150 AS:i:150 XS:i:34 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:TCGAATTG QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 A00261:525:HK77VDSX3:1:1133:17969:2613 147 chrM 10023 60 150M = 9947 -226 CAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG :FFFFFFFFFFFFFFFF:FFFFFF:FFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:150 AS:i:150 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:TCGAATTG QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 A00261:525:HK77VDSX3:1:1370:20518:3302 99 chrM 10092 60 81M = 10092 81 CTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFF,FFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:CGAGTGAT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 TR:Z:CTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAGTGATATCTCGTATGCCGTCTTCTGCTTGAAA TQ:Z:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF A00261:525:HK77VDSX3:1:1370:20518:3302 147 chrM 10092 60 81M = 10092 -81 CTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFF,FFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:CGAGTGAT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 TR:Z:CTGTCTCTTATACACATCTGACGCTGCCGACGACAGACGCGACCCTCCTGAGCCTGTGTGTAGATCTCG TQ:Z:::FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I would have expected the following two start,end pairs to be considered separate fragments: 9950 10167 10095 10167

but sinto actually only counts the second fragment here (i.e. 10095 10167), and ignores the first. What am I missing?

Thanks "

Originally posted by @rtyags in https://github.com/timoast/sinto/issues/48#issuecomment-1276728726