matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Improve regular pruning strategy #172

Closed metasoarous closed 5 years ago

metasoarous commented 7 years ago

Our present tree pruning strategy only looks at distance between seed lineage and the sequences selected. There is nothing that necessarily ensures this will give us a nuanced view of the diversity along the lineage.

Imagine a tree that looks like this:

selection_031

I'd really rather have nodes a, b & c and perhaps one or two of those in X. But the way we do things presently we might only pick sequences from X.

It seems like we need to do something roughly along the lines of taking each of the seed lineage's internal nodes and proportionally sampling their descendants in order of proximity to seed lineage. A nice feature of this is that it would also generalize naturally towards the few situations where we see two seeds falling in the same cluster:

selection_033

I could imagine a more formal and general treatment of the problem in terms of the work of moving mass through the tree towards the lineage, or some such. And perhaps there's some interesting analysis that could be done in this vein as well, e.g. testing for correlation of mass and diversity distribution along a lineage to the corresponding seed's binding affinity?

matsen commented 7 years ago

I like the idea that we would select one sequence from every subtree that branches off the seed lineage, and that we should pick the close ones. We should also pick a diversity thereof so that even if we don't have a close sequence the MRCA of the selected sequences would be close to the seed lineage. I'm less clear on how this would be achieved by proportional sampling by distance.

metasoarous commented 7 years ago

Sure; proportional sampling by distance was just one idea about how to implement something simply/easily for comparison (EDIT: and that in some sense can be framed as an adaptation of what we're already doing...). What you're suggesting about looking for diversity in these clades sounds smart to me. This seems rather interesting actually... has anyone thought about this?

metasoarous commented 7 years ago

Now that #12 has been reopened (and closed) and the related prune code modified to avoid the issues discussed above, this doesn't seem to be particularly high priority as far as CFT is concerned, so I'm going to put it on Ice. Feel free to take it off if it's something we want to keep thinking about more, as part of ecgtheow or some such. (Or if pruning to multiple seeds is a thing we want to do; I'm not really sure it actually is atm).

matsen commented 5 years ago

It seems to me that pruning is going to be a lot less important moving forward with UMIs. @lauradoepker let me know if you disagree.

Closing, but feel free to re-open.