rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

PI Sites #41

Closed BobLiterman closed 6 years ago

BobLiterman commented 6 years ago

SISRS identifies sites which are fixed within and variable among taxa. It also notes whether the variable sites represent singletons or parsimony informative sites, outputting all among-taxa variable sites (including singletons) to alignment.nex and only PI sites (no singletons) to alignment_pi.nex.

Downstream from that step, however, changeMissing and loci were using the alignment.nex information, which contains many variable sites which are singletons (50% of sites in my primate analysis, 11M/22M sites). The effect of including such sites downstream is not immediately clear, and so I wanted to run some tests.

In order to test the effects of singleton sites on different downstream analyses, I added a bit of code to output data from alignment_pi.nex just as it does in alignment.nex. The alignment_pi.nex data is output to a separate sub-directory and the base code for SISRS still uses the alignment.nex data to run changeMissing and loci (no changes to how SISRS was running before).

I will use these data to test whether contigs containing different ratios of singletons/PI sites produce more or less accurate trees.

Note: Originally I had made changes such that SISRS used alignment_pi.nex as default, but I went back and reverted it, instead maintaining native SISRS behavior while merely collected PI data on the side.