psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Read count tracking #221

Closed Irrationone closed 7 years ago

Irrationone commented 7 years ago

Hi Duncan,

When you get the time, can you look into seeing if readcounts can be tracked for duplicate sequences as per our previous discussion? I don't mean to rush you on this -- but it is a significant issue for PCR-based TCR-seq experiments; the readcounts would allow for some error correction.

Thanks!

psathyrella commented 7 years ago

oh, right, thanks for the reminder.

Irrationone commented 7 years ago

Is the collapse clones option in 8ad96085cc8acf51c10754a4a0f10af9686ac368 related to this?

Also, I suppose this may not be the case, but given that indels are no longer reversed in partis, could I just get counts by enumerating duplicate reads in the input FASTQ? Or does trimming at sequence ends/N-padding make this inaccurate?

psathyrella commented 7 years ago

uh, no -- --dont-collapse-clones just refers to allele finding, where by default we collapse clones to get more independent mutations, and hence more accurate uncertainties.

This newer stuff is collapsing identical sequences purely for efficiency reasons. And indels are definitely still reversed internally. The change is that sequences that are identical after reversing indels are no longer treated as identical (because they're not biologically identical -- they're only identical in that the sequence that goes through the hmm is identical).

Irrationone commented 7 years ago

I think what I'm looking for is actually the duplicates field in the partition output -- I didn't realize it was there.

psathyrella commented 7 years ago

hee hee, that's because I added it last week, and didn't tell anybody except the manual. Well, great, then.