Big difference in length found in same partition

krdav commented 8 years ago

As you know I have been using partis for both annotation and partitioning (using vsearch heuristics) on big datasets. I incidentally found that some of the sequences found in the same partition vary quite a bit in length, see example below taken from a large (40,000 sequences) cluster:

>133467L2P0H0
CAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAGTGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>13346L1P0H0
CAGATGCAGCTGGTGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACGTACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAGTAACCATCAACCCAGCCACATCCAAGAACCAGTTCTCCCTGCAACTGAACTCTGTGACTCCCGAGGACACGGCTGTTTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133470L1P0H0
CAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCACCAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133472L3P0H0
CAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCACCAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCGTCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133473L1P0H0
CAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCATCAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAACTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCACCCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTACAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133474L1P0H0
CAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCATCAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133475L1P0H0
CAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCGCCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAACTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAACTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133476L1P0H0
CAGGTCCAGGACTGGTGAAGTCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>13347L2P0H0
CAGATGCAGCTGGTGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACGTACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAGTAACCATCACCCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAGGGCACCCTGGTCACCGTCTC
>13348L1P0H0
CAGATGCAGCTGGTGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACGTACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAGTAACCATCAGCCCAGACACATCCAAGAACCAGTTCTCCCTGCAACTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGGCTCGATAGGGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTC
>133493L1P0H0
CAGGTGCAGCTACAGCAGTCAAGCCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGCACTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTTGGAATGACATACTACAGGTCCAAGTGGTATAATGATTATGCACTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGCTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACGCGGCTGTGTATTACTGTGCAAGGCTCGATAGTGGGAGCCGGGACGGTATGGACGTCTGGGGCCAAGGGACAATGGTCACCGTCTC

I thought that after the germline recombination event this naive sequence could only undergo somatic hypermutations changing single positions and not creating indels. Am I missing out something here? Is there a step where a full codon is cut out during the repair?

So the question is if there is a biological reason why partis group these sequences together? or is it rather a technical detail and/or related to the vsearch method?

And also I should add that using partis is a real pleasure. I am consistently getting good results and the output is just what I need.

psathyrella commented 8 years ago

Awesome! Thanks for the kind words, and glad to hear it's being useful.

So a couple of things about insertions/deletions during somatic hypermutation ("shm indels"). The first thing partis runs, for all clustering and annotation, is smith waterman. This sw step has some reasonable heuristic thresholds (configurable with the gap-opening penalty option) that flag likely shm indels. Since handling shm indels in the hmm would be computationally infeasible, what the sw then does is "reverse" the indel by excising insertions/replacing deletions with the hypothesized germline sequence. The resulting "reversed" sequence is then passed to the clustering/annotation steps.

Here is where vsearch difffers. As part of its heuristics it (apparently -- I haven't read the docs) it allows indels within clusters. This is why in the shm indel validation plots in the paper:

cdr3-indels

vsearch clustering does really well when there's cdr3 indels. This of course isn't for "free", since in realistic data sets shm indels are very rare, and the fact that vsearch doesn't know about vdj rearrangement is a huge disadvantage.

Anyway, there's of course no way to know for "sure" if those are real clusters, but a good strategy would be to run full partis clustering on one or a few clusters that vsearch spits out to see if it splits them further. Or run seed partis, seeding with a sequence from the cluster but running on the whole data set (this does the full likelihood stuff, but it's vastly faster than full partitioning since it's not all quadratic-ey.

krdav commented 8 years ago

Interesting. So actually vsearch is performing better on the simulated dataset than the full partis method when it comes to shm indels.

psathyrella commented 8 years ago

Well, on this particular data set, yes, but not in any exploitable way. In real bcr data sets, indels within the cdr3 are many orders of magnitude more likely to be the result of non-templated insertions than shm, and this fact is build into the hmm. Since vsearch doesn't know this, it performs quite a bit worse on actual data sets. Here, in order to test the effectiveness of our indel handling algorithm, we create a data set with shm indels in half the sequences, all of them in the cdr3.

jotwin commented 8 years ago

I also did some clustering with --naive-vsearch and I found a lot of variation in sequence length and cdr3 length. But I don't see enough shm indels (from the indelfos column in the annotation output) to explain this.

psathyrella commented 8 years ago

Yeah, sorry, I should have made that clear -- there's no information transfer on indels between partis and vsearch (we just pass in the sequence). So vsearch's determination of indels is unrelated to the "indelfo" annotation column, except that presumably there's some correlation. In fact the only way we know that vsearch must be allowing indels is by that performance plot above. Well, presumably the documentation would tell us as well.

scharch commented 8 years ago

As a matter of biology, indels due to SHM are just as common as point mutations; however they are much more likely to get selected against in the GC due to frameshifts or destabilizing structural effects. As a matter of computation, I would qualify this statement just a little bit:

In real bcr data sets, indels within the cdr3 are many orders of magnitude more likely to be the result of non-templated insertions than shm, and this fact is build into the hmm.

Namely, it ought to be possible to restrict this assumption to outside of the assigned D segment (which I assume you already do) and to vary the strength of the assumption based on the overall level of SHM.

psathyrella commented 8 years ago

As a matter of biology, indels due to SHM are just as common as point mutations

That's super interesting, I didn't know that. In fact I think I've typically heard it claimed otherwise, but then maybe they were always referring to after selection.

But, I guess, as long as pre-selection sequences from the GC are experimentally inaccessible, in practice it'd be indistinguishable from the case where shm indels are rare.

It ought to be, yeah, but I'm not sure if we can in practice. The hmm checks every non-templated insertion length probabiistically, which is what I think you'd be suggesting we do for shm indels as well. The problem is that perhaps the biggest reason this hmm takes a few hundredths to a tenth of a second per sequence instead of, like 30 seconds as previous hmms, is that we "factorized" the v, d, and j parts (such that the v, d, and j probabilities [for each sub length] are independent of each other, which drastically reduces the number of states you have to iterate over. If we allow shm indels at arbitrary points within the germlines, we'd be back to straight quadratic dependence on the number of states in this regard.

Because of that, we treat shm indels entirely with the smith-waterman pre-alignment step. The downside of this is that we can't do anything nice and probabilistic to decide what's an indel and what isn't: there's just the gap-opening penalty (which is a command line arg). It would be possible to adjust this penalty with shm levels, as we do for the match-mismatch parameters (well, we kind of do, we really just rerun with different match:mismatch if we don't like the alignments, but this would be easy to change).

Somewhat orthogonally to that, I'm somewhat sceptical that even in principle we could do very well at finding shm indels within the D -- there's just so little D left after deletion in so many sequences. If there were a long shm indel within the D, it would a lot of the time be separated by only a few germline D bases from a VD or DJ non-templated insertion. I think the only handle we'd have for this is shared mutational information that's subsequent to shm indels, but this has always seemed to me like a very small (and complicated) handle that may not be too much help.

scharch commented 8 years ago

It's a relatively recent finding, based on newer techniques that make it possible to observe the unselected repertoire in reasonable detail. See Figure 3 in this paper, for instance... I agree, though, that it doesn't make any difference for partis.

I guess I was imagining a 2-step/iterative process, wherein you first assign the v/n/d/n/j boundaries as per currently and then can assess the location of the detected indel and make a determination that way. Probably too complicated, though, especially for a relatively marginal case.

psathyrella commented 8 years ago

ok, good to know, though. Sounds like it'd be a good thing to implement at some point/when someone has data in which they expect super high levels of shm indels.

matsen commented 8 years ago

@scharch Thanks for all the comments!

I'm not sure if I understand

I guess I was imagining a 2-step/iterative process, wherein you first assign the v/n/d/n/j boundaries as per currently and then can assess the location of the detected indel and make a determination that way.

and wondered if you could say a little more about how the junction annotation would help us with indel annotation.

psathyrella / partis

Big difference in length found in same partition #194