Reconsider subsampling/pruning procedure

eharkins commented 4 years ago

After discussion with @lauradoepker and @mmshipley, --max-sequences option to bin/process_partis.py should no longer be default in CFT and should be exposed as an outward facing option to CFT.

Given this, we need to define the mechanism by which we identify these large clusters and decide whether to proceed with --max-sequences (despite such clusters being unlikely with UMIs):

Ideally Overbaugh person looking at the partis output for a cluster of interest will see that it is > 10k sequences and then make this decision, though not every cluster going from partis -> CFT is getting examined at the moment, so do we need to crash with a helpful message if someone runs CFT with > 10k sequences in a cluster that needs alignment? muscle will crash anyway in this case so the only alternative seems like setting --max-sequences and warning that we have automatically done so.

Note: this becomes less about crashing and more about things just taking a long time to run once aligning is no longer happening in CFT (#274), since that is the only part that will crash with > 10k sequences in my experience.

Laura also suggests that --max-sequences when used, should work like this:

begin by keeping highest multiplicity sequences as we do now
with singlets / among many sequences with the same multiplicity: subsample randomly (consider running 3 independent sampling runs of this) as opposed to now where such sequences are taken according to their existing order in memory at that point in the code.

psathyrella commented 4 years ago

That all sounds sensible, and I don't think i really have anything useful to contribute to the fix that should happen right now, but a potential long term fix might incorporate the fact that almost all clusters larger than 10k are light chain superclusters incorporating many actual families. So assuming that the trees for these superfamilies usually consist of many widely separated sublineages it might make sense to downsample by only running on the sublineage that includes the seed.

eharkins commented 4 years ago

Noted, thanks @psathyrella! Is there information about which sequences belong to these sublineages recorded somewhere in partis output to make this possible post-partis or would we need to re-partition such a "supercluster"?

psathyrella commented 4 years ago

Well sublineages would be a tree property, so in principle would come from cft running trees. But you can always have partis make these cluster path/FastTree mashup trees I keep mentioning.

eharkins commented 4 years ago

This makes sense in the absence of the need to align (whether it's because we use indel-reversed sequences, or there are no indels, or things have already been aligned in partis) since I was under the impression that we need same length sequences to make a FastTree. Then we could do what you are talking about by removing distal lineages from the seed. In fact, by pruning to the seed lineage in https://github.com/matsengrp/cft/blob/master/bin/prune.py, do we not already do this?

Also I'm not sure what

these cluster path/FastTree mashup trees I keep mentioning

refers to.

psathyrella commented 4 years ago

I talked about them in the last bcr meeting. They're the approximate trees partis makes if you run --get-tree-metrics, constructed starting from the tree implicit in hierarchical agglomerative cluster, refined with FastTree

eharkins commented 4 years ago

Ah I see, makes sense! Sorry for not remembering.

Sounds like it could be useful in subsampling large clusters, though does the issue I'm describing make sense re: aligning?

It sounded like the original motivation for --max-sequences was that large clusters over a certain threshold were overwhelming both memory use and compute time for:

alignment
un-pruned cluster tree building
un-pruned cluster tree pruning

However, --max-sequences implementation isn't as smart as we want it to be for subsampling large clusters (in cases of many seqs with equal multiplicity). So we'd really like to subsample in some smart way like a pruning strategy.

This requires a tree, which we can build any way we want so long as we don't need to align in CFT. But currently in CFT, in cases where we need to align, we need to subsample pre-alignment which means pre-tree-building, thus my (edited) question:

do we need to crash with a helpful message if someone runs CFT with > 10k sequences in a cluster that needs alignment? muscle will crash anyway in this case so the only alternative seems like setting --max-sequences and warning that we have automatically done so.

which maybe explains your statement:

I don't think i really have anything useful to contribute to the fix that should happen right now

psathyrella commented 4 years ago

I'm not sure that it relates to whether cft aligns or not -- when partis runs FastTree it's on indel-reversed sequences.

Well if all you need is a tree to inform sub sampling strategies, it seems highly useful to have an approximate tree from partis clustering. You actually don't need to run partis, and even running FastTree to "refine" the hierarchical agglomeration tree is probably a waste of time for you, since it's only refining large multifurcations, which'll only occur among very similar sequences. So you just call this method of the clusterpath class. When --get-tree-metrics is set in a partis run, it calls that with get_fasttrees=True, so you'd probably just leave it False. In order to get the cluster path trees, you also need to tell the clustering run to save the entire cluster path, which I think it doesn't do by default (since it's large), but I may have turned it on, I dunno.

eharkins commented 4 years ago

Great! This seems like it would work as a method of informing sub sampling strategies when we have these > 10k sequence clusters with many sequences having the same multiplicity.

I will implement this and talk about with @lauradoepker what the ideal behavoir should be in terms of --max-sequences applying this downsampling by default vs optionally, as well as think about how this fits in with downsampling by multiplicity.

eharkins commented 4 years ago

So long as we are still aligning sequences in CFT in some cases, we still need to think about how we should subsample in order to be able to align (since we are limited by memory to aligning some max # of seqs). Options I can think of for if we have too many sequences to align:

subsample somehow before aligning sequences
crash

Assuming we don't want to crash in this case, this is important to know:

Right now, the current subsampling setup in CFT is this:

read in partis output which has some original number of sequences in the clonal family
if there are more than 10k sequences, take the top 10k ranked by multiplicity, with no respect to ties
align if we have given --preserve-indels, otherwise use indel-reversed
build a fasttree on aligned sequences from step 3
subsample (prune) fasttree to max 100 seqs and then proceed building ML trees etc

So we already

subsample somehow before aligning sequences

in 2. but we are not satisfied with how that's done, since it treats all sequences with equal multiplicity the same, when in fact some might be more worth keeping when considering the tree.

Assuming we want to rework how we do step 2, one idea is: Take a tree (partis or fasttree) created from indel-reversed sequences (so we don't need to align) and subsample from that tree, then after subsampling we can align the non-indel-reversed sequences corresponding to the subsampled set, and proceed to step 4.

However, it might make sense to try to consolidate the subsampling in 2 with the subsampling (pruning) in 5. This would look something like:

read in partis output which has some original number of sequences in the clonal family
build a fasttree (or use partis tree) indel-reversed sequences (since we haven't aligned yet)
subsample (prune) tree from step 2 to max 100 seqs
align (if we have given --preserve-indels and there are indels) the non-indel-reversed sequences corresponding to the pruned set of sequences
build ML trees using aligned 100 seqs from 4.

The only downside is that the tree we subsample from does not pay respect to indels.

Thoughts on this or alternatives @matsen @lauradoepker @mmshipley @psathyrella ?

matsen commented 4 years ago

I like this idea as long as it's not too cumbersome to code.

The indels, unfortunately, don't contribute much to phylogenetic analysis. Certainly a no-indel tree is sufficient for an initial prune.

But, couldn't we prune down to a moderate set, say 500 sequences, align, and then another round of pruning?

On Thu, Oct 17, 2019 at 5:31 PM Elias Harkins notifications@github.com wrote:

So long as we are still aligning sequences in CFT in some cases, we still need to think about how we should subsample in order to be able to align (since we are limited by memory to aligning some max # of seqs). Options I can think of for if we have too many sequences to align:

subsample somehow before aligning sequences

crash

Assuming we don't want to crash in this case, this is important to know:

Right now, the current subsampling setup in CFT is this:

read in partis output which has some original number of sequences in the clonal family

if there are more than 10k sequences, take the top 10k ranked by multiplicity, with no respect to ties

align if we have given --preserve-indels, otherwise use indel-reversed

build a fasttree on aligned sequences from step 3

subsample (prune) fasttree to max 100 seqs and then proceed building ML trees etc

So we already

subsample somehow before aligning sequences

in 2. but we are not satisfied with how that's done, since it treats all sequences with equal multiplicity the same, when in fact some might be more worth keeping when considering the tree.

Assuming we want to rework how we do step 2, one idea is: Take a tree (partis or fasttree) created from indel-reversed sequences (so we don't need to align) and subsample from that tree, then after subsampling we can align the non-indel-reversed sequences corresponding to the subsampled set, and proceed to step 4.

However, it might make sense to try to consolidate the subsampling in 2 with the subsampling (pruning) in 5. This would look something like:

read in partis output which has some original number of sequences in the clonal family

build a fasttree (or use partis tree) indel-reversed sequences (since we haven't aligned yet)

subsample (prune) tree from step 2 to max 100 seqs

align (if we have given --preserve-indels and there are indels) the non-indel-reversed sequences corresponding to the pruned set of sequences

build ML trees using aligned 100 seqs from 4.

The only downside is that the tree we subsample from does not pay respect to indels.

Thoughts on this or alternatives @matsen https://github.com/matsen @lauradoepker https://github.com/lauradoepker @mmshipley https://github.com/mmshipley @psathyrella https://github.com/psathyrella ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matsengrp/cft/issues/295?email_source=notifications&email_token=AAA3QRD7XQI6V2INAW5TXIDQPD7WZA5CNFSM4I7QC4H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSADYI#issuecomment-543424993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3QRH2XDPKENMGPUFBAQ3QPD7WZANCNFSM4I7QC4HQ .

-- Frederick "Erick" Matsen, Associate Member Fred Hutchinson Cancer Research Center http://matsen.fredhutch.org/

eharkins commented 4 years ago

@matsen yes we could do that. Would the idea be that we still want to consider indels in our final step of pruning?

Separately, would we want to try to respect multiplicity at all as --max-sequences has done in bin/process_partis.py or do we want to just focus on tree-based subsampling? I created this issue to figure out how to rework the --max-sequences option, since I think we either need to re-implement how it subsamples there (since it ends up being arbitrary in many cases), or just remove that option and require that subsampling be done separately.

psathyrella commented 4 years ago

Sounds good, and just to clarify/remind, the idea I think is that most clusters over 10k sequences are spurious light chain super-clusters, so "pruning"/subsampling could presumably take the form of very simple-mindedly removing distantly related branches.

matsen commented 4 years ago

Would the idea be that we still want to consider indels in our final step of pruning?

Yep!

would we want to try to respect multiplicity at all

Yes, I think that's a good idea. I don't know how this currently works, and/or how it would work in a tree-based setting. Chat Monday?

eharkins commented 4 years ago

@matsen and I discussed this today and we both feel that if we are considering re-working the subsampling/pruning methods used in CFT in any way, we should spend the time designing something that will be applicable to both current and future datasets.

I'm recording here some of what we talked about, to be followed up with @mmshipley @lauradoepker @psathyrella during our meeting next Wednesday.

subsampling/pruning methods

Currently, we have a subsampling/pruning procedure outlined in my above comment. If we are considering changing this at all, we could use (but are not limited to) some combination of the following methods:

taking the top N sequences according to multiplicity. These are less likely to have been observed by sequencing error. Doesn't cover case of many singletons
blast (in seeded case) for top N matches against the seed sequence. Doesn't cover unseeded case
phylogenetic pruning. Doesn't scale to large clusters
various forms of clustering (and various methods of sampling from clusters -- e.g. just take the seed cluster or take some subsample from each cluster):
- UMAP
- vsearch
- etc

data considerations

seeded clusters vs unseeded clusters; our methods should apply to both (how much do we care about the unseeded case?)
what is the scale of future clusters given UMIs?
how frequently (now and in the future) do we observe clusters large enough that our current methods don't scale well (e.g. cluster over 10k sequences full of singletons)?

eharkins commented 4 years ago

As @matsen suggested today, for now we will just stop execution of CFT if we encounter a cluster above our threshold for size. We don't predict encountering such clusters in future datasets using UMIs, and if we do we would probably like to decide how to proceed on a case-specific basis.

I will just add an exception based on cluster size, and remove --max-sequences from our call to bin/process_partis.py.

Should we still consider implementing some solution for --max-sequences in the case of many 'singlets' (sequences with multiplicity = 1) since it will arbitrarily downsample in that case? Even though that option will no longer be used in CFT for now, we might consider resolving this case if we are going to leave the option in the script.

matsen commented 4 years ago

Why wouldn't we leave in --max-sequences and have it set the count at which the exception is thrown?

Not sure why the "singlet" case is different here.

eharkins commented 4 years ago

That works too, I was assuming that the exception was something specific to the pipeline and considering that --max-sequences might still get used to downsample by multiplicity for uses of bin/process_partis.py beyond getting called from our default pipeline.

matsen commented 4 years ago

You know better than me! This was just my reaction.

eharkins commented 4 years ago

In the above PR, went with

add an exception based on cluster size, and remove --max-sequences from our call to bin/process_partis.py.

Adding a warning to --max-sequences given that

the exception was something specific to the pipeline and considering that --max-sequences might still get used to downsample by multiplicity for uses of bin/process_partis.py beyond getting called from our default pipeline.

matsengrp / cft

Reconsider subsampling/pruning procedure #295

subsampling/pruning methods

data considerations