Open jeremycfd opened 7 years ago
This gets a little tricky since our current working assumption is that the analyzed sequences are not complete (they never have all the V region necessary to build a full receptor, for example). Is the problem with this one that we can't even reconstruct a complete CDR3? I would have thought that would cause a parse error ("seq_too_short" or something) and lead to that sequence being discarded. What CDR3 is assigned in this case?
CATDKAGGLSDIQNP
tgtgctactgataaggctggaggactaagtgacatccagaaccca
Ahhh. I see, it's not truncated, it's that there's extra junky sequence there. Huh. Yeah, we could add some logic to catch cases like that.
It is currently possible for sequences that are missing required parts of certain gene segments to evade the parts of our pipeline that filter out of frame TCRs. For instance: AAGGCCCTGCCCAGCTAATCTTAATACGTTCAAATGAGCGAGAGAAGCGCAGTGGAAGACTCAGAGCCACCCTTGACACTTCCAGCCAGAGCAGCTCCCTGTCCATCACTGGTACTCTAGCTACAGACACTGCTGTGTACTTCTGTGCTACTGATAAGGCTGGAGGACTAAGTGACATCCAGAACCCAGAACCTGCTGTGTACTGACACCCCAGATCGGAAGAGCGTCGTGTAGGGAAAGAG produces what the pipeline considers a valid TCR, but it is clear that a portion of the J segment is missing. I do not currently believe that this is a widespread issue when the quality of the data is good; however, certain bulk-sequencing approaches that some use to prepare data for TCRdist use assembly algorithms in the process, and this assembly can introduce errors that TCRdist should be able to identify as problematic during parsing. (This particular sequence was generated by MiXCR.)