Closed lauradoepker closed 7 years ago
This sounds like a great idea to me! I'd love to reduce the compute and data overhead on cft and cftweb, and this would be very helpful towards that end.
Adding the option to specify the minimum final/desired cluster size, rather than the final number of clusters (as we did before) is definitely something that makes sense to implement, and looking through to remind myself how I did that, it looks feasible. It'll involve enough testing and digging through the guts of things that I think the earliest it'd happen is several weeks out, though.
I've now implemented this in the SConstruct. It's potentially a little flawed in that the 50 cutoff doesn't account for sequences removed because of stop codons or frameshifts. However, this should only ever put us off by a sequence or two, so I'm not terribly worried about it, and it's a good thing for increasing the performance of the cft pipeline.
I'm going to close this here for now, but with your permission @psathyrella, I'd like to open a corresponding issue for partis. Seems like it's still a valuable option for partis regardless of my workaround. Does that sound okay?
yeah, go for it, and please link back to this to remind me of the history
@psathyrella and I realized that forcing extra clustering steps on all the trees may be wasteful and unnecessary because we only need this function when the "clustering step 0" (=most likely partition) doesn't have enough sequences. "Enough" sequences is very subjection, however.
Perhaps we should only perform additional clustering steps if we have fewer than 50 sequences?
Thoughts @psathyrella ?