matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Output full partis cluster fasta #293

Closed eharkins closed 4 years ago

eharkins commented 4 years ago

CFT downsamples partis clusters to 10k sequences according to multiplicity. This is probably to make FastTree building etc happen in reasonable time, so while we may want to continue doing things this way for the CFT pipeline specifically, @lauradoepker is relying (for other analyses like #186 in both Linearham and CFT contexts) on getting an unadulterated fasta of the partis cluster sequences from CFT, so we need to add this to the variety of files output by bin/process_partis.py

eharkins commented 4 years ago

Ways that CFT downsamples in bin/process_partis.py:

  1. if we specify --max-sequences (this is always set in CFT currently and accounts for the 10k limit)
  2. if we specify any of --remove-stops --remove-frameshifts --remove-mutated-invariants (all three are always set in CFT)
  3. if we specify --largest-cluster-across-partitions it will deduplicate since often this results in duplicate IDs ending up in the cluster
  4. if we specify --match-indels-in-uid it will limit to those seqs with the indel of interest

Updates:

  1. There now exists partis/bin/extract-fasta.py should we need to get a clonal family fasta with the exact set of sequences in the clonal family from partis
  2. I tried running a > 10k sequence cluster with CFT and had success when running using indel reversed sequences (without aligning). When I tried using --preserve-indels (which forces using non indel reversed sequences and therefore aligns them) muscle crashed because it exceeded some memory limit on a large cluster node. Once we no longer need to align in CFT, this shouldn't be an issue, although doing things like making the FastTree and pruning for such clusters will still take super long.

Thoughts on a long term plan for downsampling clusters vs not in CFT @lauradoepker @matsen @psathyrella ? This seems important given that the CFT-pruned fastas are the current standard for linearham input, and are affected by how we downsample the input sequences to our pruning step.

eharkins commented 4 years ago

Closing given that the goal of

getting an unadulterated fasta of the partis cluster sequences

is achieved via

partis/bin/extract-fasta.py should we need to get a clonal family fasta with the exact set of sequences in the clonal family from partis

Further discussion of downsampling strategies should take place in #295