Closed rvosa closed 8 years ago
In the original data set (orthologous clusters) we still have 2500 identical pairs of sequences (within alignments), in total, 255 alignments contain sequences that are identical (original dataset: 1140 alignments, 18280 sequences).
In the replicated dataset, there are 3006 identical sequence pairs, this happens in 255 of the alignments (replicated dataset: 1105 alignments, 14726 sequences).
I will close this, as there were no more issues about the simulations raised by the reviewers.
It is possible to get negative branch lengths in simulated data because the simulated alignments are not always divergent enough:
=> leading to clade trees with very short branches (and possible rounding errors in BEAST) => leading to grossly expanded clade depths once rescaled to the backbone timescale => leading to negative branches
I am getting the impression that this is at least one of the explanations because in the real data we actually do some filtering to weed out identical haplotypes whereas in the simulations we don't. You can see in fig 5 that the average pairwise distance in the simulated alignments skews closer to zero and the proportion of invariant sites skews nearer to 100%. It would be interesting to see how many completely identical haplotypes we are still getting in real data sets compared to simulated ones.
I assume that it would be quite difficult to nudge the skewness in the right direction simply by increasing the substitution rates because that is probably going to overshoot the target pretty quickly. My current guess is that it would be easier to have as an option in the replicate command to say '--phylogenetically-informative' (-pi), which only retains informative alignments and skips over ones that have identical sequences. Presumably the behavior should then be that the simulator keeps going until the target number of alignments is produced.