rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

Parsimony-Informative Sites #40

Closed BobLiterman closed 6 years ago

BobLiterman commented 6 years ago

SISRS identifies sites which are fixed within and variable among taxa. It also notes whether the variable sites represent singletons or parsimony informative sites, outputting all variable sites (including singletons) to alignment.nex and only PI sites (no singletons) to alignment_pi.nex.

Downstream from that step, however, changeMissing and loci were using the alignment.nex information, which contains many variable sites which are singletons (50% of sites in my primate analysis, 11M/22M sites). In that vein, I adjusted SISRS to output data from alignment_pi.nex for downstream analysis (changeMissing + loci), while outputting parallel data from alignment.nex into a new sub-directory. Maintaining the alignment.nex file data is critical, as singleton sites do carry information about rates of evolution.

As the distinction between these datasets is explored more fully, we can think about how this data can be used, or whether it's even worthwhile to make such a distinction. By separating out the data as I have done here, it will allow us to test whether including singletons has any impact on subsequent tree-building.