matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Only perform clustering steps if fewer than ____ sequences in "clustering step 0" #183

Closed lauradoepker closed 7 years ago

lauradoepker commented 7 years ago

@psathyrella and I realized that forcing extra clustering steps on all the trees may be wasteful and unnecessary because we only need this function when the "clustering step 0" (=most likely partition) doesn't have enough sequences. "Enough" sequences is very subjection, however.

Perhaps we should only perform additional clustering steps if we have fewer than 50 sequences?

Thoughts @psathyrella ?

metasoarous commented 7 years ago

This sounds like a great idea to me! I'd love to reduce the compute and data overhead on cft and cftweb, and this would be very helpful towards that end.

psathyrella commented 7 years ago

Adding the option to specify the minimum final/desired cluster size, rather than the final number of clusters (as we did before) is definitely something that makes sense to implement, and looking through to remind myself how I did that, it looks feasible. It'll involve enough testing and digging through the guts of things that I think the earliest it'd happen is several weeks out, though.

metasoarous commented 7 years ago

I've now implemented this in the SConstruct. It's potentially a little flawed in that the 50 cutoff doesn't account for sequences removed because of stop codons or frameshifts. However, this should only ever put us off by a sequence or two, so I'm not terribly worried about it, and it's a good thing for increasing the performance of the cft pipeline.

I'm going to close this here for now, but with your permission @psathyrella, I'd like to open a corresponding issue for partis. Seems like it's still a valuable option for partis regardless of my workaround. Does that sound okay?

psathyrella commented 7 years ago

yeah, go for it, and please link back to this to remind me of the history

metasoarous commented 7 years ago

Great; see https://github.com/psathyrella/partis/issues/239