Closed eharkins closed 4 years ago
Ways that CFT downsamples in bin/process_partis.py:
Updates:
--preserve-indels
(which forces using non indel reversed sequences and therefore aligns them) muscle crashed because it exceeded some memory limit on a large cluster node. Once we no longer need to align in CFT, this shouldn't be an issue, although doing things like making the FastTree and pruning for such clusters will still take super long. Thoughts on a long term plan for downsampling clusters vs not in CFT @lauradoepker @matsen @psathyrella ? This seems important given that the CFT-pruned fastas are the current standard for linearham input, and are affected by how we downsample the input sequences to our pruning step.
Closing given that the goal of
getting an unadulterated fasta of the partis cluster sequences
is achieved via
partis/bin/extract-fasta.py should we need to get a clonal family fasta with the exact set of sequences in the clonal family from partis
Further discussion of downsampling strategies should take place in #295
CFT downsamples partis clusters to 10k sequences according to multiplicity. This is probably to make FastTree building etc happen in reasonable time, so while we may want to continue doing things this way for the CFT pipeline specifically, @lauradoepker is relying (for other analyses like #186 in both Linearham and CFT contexts) on getting an unadulterated fasta of the partis cluster sequences from CFT, so we need to add this to the variety of files output by bin/process_partis.py