Publish prune.py for BF520.1 paper

lauradoepker commented 6 years ago

We'd like to publish the prune.py script with our upcoming manuscript. Unfortunately, we have two different prune.py scripts at work for the BF520.1 paper because we have old data from partis & CFT v9 and new/better data from partis & CFT v15.

@WSDeWitt and @metasoarous , can you summarize the functional differences between these scripts? I recall that you caught a bug and reworked the script a few months ago, but unfortunately this paper used both old and new versions. Your summary will help me (and @matsen ?) decide if I need to publish both versions of this script or just the more recent.

I have compared the annotated fasttrees that depict which sequences are chosen by the script so you can have a visual handle on how different the pruned lists are between "Ecgtheow" and "ML/Pars" datasets. Key: pruned = red, naive and seed seqs = blue, rest of the cluster = black.

Ecgtheow (partis v15) = https://github.com/matsengrp/cft/blob/master/bin/prune.py ML/Pars (partis v9) = https://github.com/matsengrp/cft/blob/lauras-first-almost-immortal-trees/bin/prune.py

Ecgtheow FastTree: bf520 1-igh family_0 healthy seedpruned 100 ids tre ML/Pars FastTree: pruned_ids txt tre

Command used to manually generate this ML/Pars FastTree after-the-fact: xvfb-run -a ./python/annotate_fasttree_tree.py /fh/fast/matsen_e/grp/matsengrp/working/csmall/cft/output/archived/laura-mb-v9-dnaml/BF520.1-igh/BF520-h-IgG/run-viterbi-best-plus-0/fasttree.nwk /fh/fast/matsen_e/grp/matsengrp/working/csmall/cft/output/archived/laura-mb-v9-dnaml/BF520.1-igh/BF520-h-IgG/run-viterbi-best-plus-0/pruned_ids.txt --naive naive0 --seed BF520.1-igh

^ @metasoarous, I think this is the correct-ish path to the data that Lauren used (tagged "lauras-first-almost-immortal-trees") You once tried to recreate Lauren's trees for us, but it'd be nice if you verified I found the right files in your working output directories?

metasoarous commented 6 years ago

@lauranoges The original prune.py script was not implemented as intended. It may be helpful to look at this particular issue/comment: https://github.com/matsengrp/cft/issues/12#issuecomment-316366942.

In summary, the original method was simply taking the closest N leaves to the seed lineage.

As you can see from #12, the intent here was that we first take all of the K clades branching off from the seed lineage, and from each such clade select the closest N/K nodes to the seed lineage.

The difference here is a bit subtle, but #172 tries to explain why the original/simple approach may not always do what we'd want/expect. In short, you could end up not sample some of the clades branching off the seed lineage, as the pathological example in #172 illustrates.

Not quite sure how to sell that story for the paper other than to just explain things as best you can. For reproducibility, I do think it's worth having a pointer to each of the scripts though.

As for the "immortal trees", I think you have the right dataset. Do things seem to match up there? I can check build versions or something if you like.

metasoarous commented 6 years ago

@lauranoges I think we resolved everything here when we met last week, yes?

lauradoepker commented 6 years ago

Yes

matsengrp / cft

Publish prune.py for BF520.1 paper #242