Closed hyanwong closed 1 year ago
We could always add a pass to create this column after the export_recombinant_breakpoints
function is run, but it seemed cleaner to calculate everything we need at once.
Oh bugger, I forgot that simplify()
doesn't remove mutations above the root. So simplified_ts.num_mutations()
is not correct.
We can look at the mutations above the sample nodes in the simplified ts though, so easy to correct in this case, and I have done so in https://github.com/jeromekelleher/sc2ts/commit/309cbf70e747f3a0bacefdf7b79a79b5d4cc6a22
As discussed yesterday, would be simpler to consider sequence identity here via the variants function. Something like
H = ts.genotype_matrix(samples=[left_parent, right_parent]).T
sequence_diffs = np.sum(H[0] != H[1])
This adds the max of (number of mutations that separate the left parents on the forward and backward passes, number of mutations that separate the right parents on the forward and backward passes), which we agreed could provide the most stringent measure of "HMM consistency".
However, it uses repeated rounds of simplification to calculate the number of mutations between parent nodes, so is slow: it takes
export_recombinant_breakpoints
from 1.5 secs to 12 secs. Personally I'm happy with that, and don't want to spend much time optimising it any more, but @jeromekelleher might have a better method?