matsengrp / cft

Clonal family tree
5 stars 3 forks source link

min_adcl tree trimming? #125

Closed metasoarous closed 7 years ago

metasoarous commented 7 years ago

It's my understanding that @lauranoges et al. are sometimes interested in sequences that are highly evolved from the naive sequences (that's why we have #102, #121 and #124 ). However, our current tree trimming strategy is (for larger clusters) selecting against including such sequences, in favor of sequences that are close to the lineage between naive and seed. The current behavior was born out of an assumption that the most "interesting" or "useful" sequences are going to be those that are close to the naive-seed lineage. And indeed, that often is what we want to see, as it gives us a better picture of the evolution from naive to seed. But again, that's not every case, or we wouldn't have #102, etc.

Some time back, @matsen came up a method for selecting diverse representatives called veronoi. This or something like it might be a nice way to solve this problem. Because we know we're also interested in seed lineage, we may also want to do some sort of combined strategy, making sure we have enough sequences close to the seed lineage that we have a strong ancestral state reconstruction, but using veronoi to make sure we also sample the full diversity of the tree.

My sense is the course here really comes down to what @lauranoges and crew would find most useful/helpful. Thoughts?

lauradoepker commented 7 years ago

Oh! I hadn't reflected on the fact that within the clusters (aka partitions), we were trimming for sequences that were closer to naive even within the "clonally related" cluster of sequences. Perhaps we shouldn't be trimming? I agree this defeats our goals of finding highly mutated clonal sequences. I don't know what Veronoi is, but I look forward to learning about it. Thanks @metasoarous

matsen commented 7 years ago

Here's the citation:

Matsen, F. A., Gallagher, A., & McCoy, C. O. (2013). Minimizing the Average Distance to a Closest Leaf in a Phylogenetic Tree. Systematic Biology, 62(6), 824–836. https://doi.org/10.1093/sysbio/syt044

And here is a blog post: http://matsen.fredhutch.org/general/2012/05/31/adcl-paper.html

It's still going to exclude the most mutated sequences, with the idea that they might be sequencing errors, but it could prove useful. Perhaps take a look, @lauranoges ?

metasoarous commented 7 years ago

@lauranoges We definitely need to be trimming one way or the other (though perhaps we could up our threshold a bit). Our ancestral state reconstructions would take too long if we didn't, and we'd also get some really big trees which would make the UI harder to work with effectively. So really the question is how we trim.

Just to clarify on @matsen's comment about the most mutated sequences: Basically, what the algorithm is doing is clustering the tips of the (potentially) very big tree, and for each cluster selecting the medoid of that cluster. The medoid of a cluster is the member of the cluster which is on average closest to all the other members of that cluster. Intuitively, you can think of it as being in "the center" of the cluster. So as Erick points out, this selection criteria will tend to exclude the most mutated sequences in the cluster. But it will still do a good job of representing branches of the tree that exhibit above average mutation from naive (while, as Erick mentions, being less likely to select for sequencing errors).

metasoarous commented 7 years ago

@matsen and I had a chat about this and some concerns/issues came up I'm jotting down to make sure we keep pushing this thread forward.

The main issue is that if we try to represent the full phylogenetic diversity of the tree in our downsampling, we might be missing the detail we'd like around the seed lineage for arriving at a good ancestral state reconstruction. One idea is that we could "mix" a selection of nodes close to the seed lineage with a broad selection using rppr min_adcl_tree (voronoi). The problem with this (as we discussed) is that we'd end up with a somewhat unbalanced representation of the complete phylogenetic diversity/distribution. This felt a bit offsetting to both of us, and @matsen said he'd think about it a bit further.

Seems worth mentioning that one way we could go here to is to separately run both downsampling methods (togglable via a build flag), and load both into the cftweb interface as separate datasets (#123). If you want good seed lineage ancestral reconstruction, you can select one method, and if you're interested in broad diversity you can select the other. I say this with some reluctance as it means a bit more work on my end keeping data builds up to date for two separate methods, but it does seem to solve the problem nicely without imbalancing the trees. A related approach would be to have the build always compute both (as separate nests), but separate the results at the end into two datasets. This introduces some complexity in how we name and identity separate builds in the UI, but could be a smoother way forward if we think we'll always want to have access to both approaches. Something along these lines could also nicely generalize to our present forking of dnaml vs dnapars.

Thoughts @matsen?

metasoarous commented 7 years ago

Oh... the other thing we talked about in relation to this is #10. Once we're running on multiple time points, we may need to decide how/whether our downsampling strategy considers them (e.g. we wouldn't want our sampling criteria somehow being biased against one of the timepoints, or representing the timepoints unevently).

metasoarous commented 7 years ago

@matsen I'd rather like to just give this a go (either with another flag or via the dataset_param nest al a #155); Laura's data in particular has more big clusters than we saw in the kate-qrs dataset, so if they look for broad diversity there, they ain't gonna find it.

I think for a first pass, it's worth just downsampling using rppr, without doing anything fancy wrt either timepoints or lineage selection. My guess is that after looking at the results, we'll have a better sense of whether we really need to worry about these other issues or not. In particular, I've noticed that on a lot of the trees I've looked at so far, most of the diversity we've selected for is close to the naive, and then there's a branch that stick out with a few sequences aroung the seed. So we may really not be missing a ton by just looking for rppr clusters. But that's what I'd like to see.

Oh... and I just had an idea for timepoints. We wouldn't necessarily need to consider timepoints in our downsampling strategy. As long as we can map back from representative sequences to proportions from each time point, we can represent this information in little colored pie charts on the tips of the tree to get a sense of the timepoint distribution. This may present challenges with the auspice model ( #99 ), but we can probably find a way around this.

Permission to queue this up @matsen?

metasoarous commented 7 years ago

Case in point of a sparse lineage - http://stoat:5555/cluster/c6662947499378005173 (just one example, I know, but also not cherry picked).

metasoarous commented 7 years ago

Right... one more thought: With something like #155 in mind, it would be easy enough to just always build both of the downsampling strategies so that when you want to look at the lineage, you look at the lineage selection dataset/bulid. When you want to look at the broader diversity, you look at that. Maybe This let's you always see what you need without conflating strategies or producing "weirdly" balanced trees.

metasoarous commented 7 years ago

@matsen I've been taking a crack at this, but I've been unable to figure out how to get rppr min_adcl_tree to do what I want it to. For starters, the docs have the following options:

  --leaves          The maximum number of leaves to keep in the tree.
  --always-include  If specified, the leaf names read from the provided file will not be trimmed.

First off, I would expect that the output would be a list of sequences kept, not discarded. However, if I run with --leaves 20 on a tree with 82 leaves, I get an output of length 62. So that's weird. But also, when I try specifying a file with line separated leave names for --always-include, some of those seqs end up in the output, and some don't, making me think that option isn't actually doing anything. Do you know how this thing is supposed to work @matsen, or should I submit some issues for pplacer?

matsen commented 7 years ago

Docs: min_adcl_tree finds a good collection of sequences to cut from a tree

Not sure about --always-include... I don't recall.

metasoarous commented 7 years ago

I see; those docs hit when you rppr --help, but not rppr min_adcl_tree --help. Seems like that description would be helpful in both places.

Well... I can hack around --always-include by replacing the closest matches to the seed and naive seqs with those sequences, respectively. But I'd rather be doing the right thing if we can figure it out. Isn't there a Google Group or something?