matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Keep all seeds when pruning and minadcl thinning #180

Closed lauradoepker closed 6 years ago

lauradoepker commented 7 years ago

When we prune or thin our trees, can we specifically keep all seed sequences in the cluster? I think they’ve gotten pruned out and thinned out of both regularly-pruned and minadcl trees. @csmall (not sure who else to tag here)

Example: 043 is clonally-related to other seeds such as 048, 049, 022 but they're not on this tree http://stoat:5052/cluster/c486e4c4-127e-3ec9-93c4-fcea8d5692b2

metasoarous commented 7 years ago

Yes! This is a great idea. Ideally, we would figure out which seeds are clonaly related before hand, and do one computation for each such set of seeds, instead of running the pipeline for each seed separately.

@psathyrella I'm sure I can find a way to do this in cft, but I wonder if it doesn't make more sense for this to happen in datascripts? After all, it seems like it would be better if partis was only getting run once for a cluster with multiple seeds, right? Should there be a pre-clustering step for finding seed sequences that are likely to belong to the same clonal family?

lauradoepker commented 7 years ago

My understand is that partis is run agnostic to which seeds are related or not and it's a very good thing when partis "finds" and confirms to us that certain seeds are in fact clonally related.

As far as not pruning out seed sequences, can't we just search for any sequence that starts with QB**** or whatnot? Or has different formatting than all the other leaves?

psathyrella commented 7 years ago

Yeah we can't say beforehand which seeds are clonally related (at least if we want to get the same answer as we would get by running separately and seeing what happens), because the likelihood of them being clonal depends on what other sequences are in the sample.

Visually, if the sample has lots of sequences that are "between" two seed sequences, then they're more likely to be clonal (i.e. they're on opposite "sides" of the family).

metasoarous commented 7 years ago

That makes sense. Have we looked yet at how often we tend to pick up the exact same set of seed sequences though? I agree that if we don't always get the same sets, it's good to run separately on each, so that we get multiple angles of verification that the seeds are clonally related. But if we literally never see this, it would be more efficient in computation and data payload (as well as conceptual overhead in navigating the data) to think about just a single cluster for each such set.

lauradoepker commented 7 years ago

@metasoarous Back in the older trees, the clonal seeds were indeed selected out with one another. Sometimes the tree was missing one or another, but our expected seeds were mostly present in each of the trees.

Now they're all missing, presumably because they've been pruned out (?). Not sure how this pruning process changed between before and now.

psathyrella commented 7 years ago

Yeah I don't think we want to do that much tuning to the particulars of individual sample/seed combinations. That would basically require that we run once to see what the clustering results are like, and then run again based on those results.

What I think would be sensible would be to change our seed sequence definitions by hand beforehand, e.g. if BF520.x and BF520.y are very similar, we could replace them with a new seed that's their joint naive sequence. But I don't think we can automate this, because as above ^ it's answering a different question than are they clonal in the context of a given sample. But it just adds a sanitization step to what's already the by-hand process of adding the seed sequences.

For instance, running plain partitioning on the new MG505 seed sequences, the question arises @lauranoges, would you be comfortable collapsing any of these groups of seeds into one sequence?

p

This wouldn't help in deciding whether cft should support multiple sequences-of-interest (i.e., keep track of seed sequences that aren't the seed at the moment), but perhaps it gives enough of what we want that we can kick that can down the road.

lauradoepker commented 7 years ago

@psathyrella Those clusters are exactly the families I gave you, yes. I checked them all by hand. I don't want to collapse them because I want to see where they fall on the new trees. I maintain that the best solution is to specifically keep the seed sequences during the pruning steps.

metasoarous commented 7 years ago

OK; Sounds good. Just wanted to make sure we weren't doing extra work unnecessarily.

I should be able to sort this out. I don't want to just grep for things that look like QB* though, because I hate baking assumptions like this into the code as much as I can help it (reduces generality and leaves bugs for folks down the line). It's a little more work, but I think I can properly figure out which seed sequences fall in each unpruned cluster, and then prune/filter taking those seed sets into account.

@lauranoges Since @WSDeWitt is going to be doing some work on the lineage selection (pruning) strategy before we rerun, should we try to sneak this in as well before rebuilding? Also @WSDeWitt: Do you think you'd be able to adapt your work so that if multiple comma separated seed ids are passed through the command line invocation, it keeps all of them, but specifically prunes to the lineage of the first? Or maybe we should have two cli args: --seed and --other-seeds?

lauradoepker commented 7 years ago

Yes, sneaking this in is actually a must-do @metasoarous because it's critical for rational analysis of the trees on our end. Let's do it!

psathyrella commented 7 years ago

@metasoarous I'm perhaps misunderstanding something, but there shouldn't be any reason to be thinking of greping and whatnot -- all the seed info, and filtering that info by subject/sample/whatever is all automatically in datascripts.

psathyrella commented 7 years ago

e.g. when I added this functionality to the partis fasta reader, datascripts passes all the seeds as an extra argument like --queries-of-interest, and then the fasta reader pulls those out before deciding which queries to drop.

metasoarous commented 7 years ago

@psathyrella I was responding to Laura's comment from a couple days ago. I think I have all the information I need to do things properly. I was just stating for the record.

metasoarous commented 7 years ago

Closed as of d72057d

lauradoepker commented 7 years ago

In MG505 septmpts dnaml from the 2017.08.02 deployment on stoat:5052:

We have cluster 0 with logprob -774 (i.e. best) with a tree of only clonal seed seqs (4 of them). http://stoat:5052/cluster/5c5d90c3-9100-3016-9711-cfb0d5e04572

Then we have cluster 1 with logprob -126402 (i.e. worse) with a tree with hundreds of sequences but not all of the other seed sequences in the family anymore (missing one or more of the 3 other clonal seeds). These sequence hits in this lower logprob cluster look really good, I think. They are quite similar to the seeds. http://stoat:5052/cluster/ead289e8-e004-31e3-85d5-ccaaddcdbf4f

@psathyrella @metasoarous @WSDeWitt this appears to be a pruning problem again, perhaps?

metasoarous commented 7 years ago

@lauranoges @psathyrella My sense from discussion on Slack is that @psathyrella's presently rebuilding with a fix to the subsampling on his end which should re-resolve this issue. Going to assign and move into In Progress.

metasoarous commented 6 years ago

@psathyrella @lauranoges Does the v14/v3 data close this, or does this depend on v15/v4?

metasoarous commented 6 years ago

This should be fixed now to my understanding according to the way we are retaining seeds in both our pruning and our early process_partis downsampling. If there's anything upstream of my data that needs to be updated, feel free to reopen or open a new issue.