matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Script up some parsimony analysis #2

Closed matsen closed 8 years ago

matsen commented 8 years ago

The next step is to run some parsimony trees on the Victora data. This will require parsing the Victora data files and making a file suitable for a phylogenetics program. In my exploratory steps I used PHYLIP, and I think I first converted the Victora files to FASTA, then used seqmagick to convert to Phylip format. Note that there is a maximum sequence name length for .phy files. You'll need to deal with that, perhaps through recoding, or through finding some substring that's unique.

Phylip will return a lot of trees, which is good! Some of these will have fractional mutations.

  1. Please check in examples that every tree shown with a fractional mutation has an equivalent tree that has integer mutations.
  2. If that's true then we can just filter out the trees with fractional mutations.
wsdewitt commented 8 years ago

TasParse.py makes a phylip file of the fasta from Luka. It duplicates the > 17 sequence 17 times, and omits the > GL germline sequence (since it was not observed). In history.bash, phylip's dnapars is called to generate parsimony trees, then these are passed to the branching process likelihood code, which filters out trees containing fractional mutations. Is there a way to constrain parsimony trees to use a specified root sequence (the germline Vh), which in this case was not observed?