jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Add fwd_bck_parents_max_mut_dist to breakpoint table #141

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

This adds the max of (number of mutations that separate the left parents on the forward and backward passes, number of mutations that separate the right parents on the forward and backward passes), which we agreed could provide the most stringent measure of "HMM consistency".

However, it uses repeated rounds of simplification to calculate the number of mutations between parent nodes, so is slow: it takes export_recombinant_breakpoints from 1.5 secs to 12 secs. Personally I'm happy with that, and don't want to spend much time optimising it any more, but @jeromekelleher might have a better method?

hyanwong commented 1 year ago

We could always add a pass to create this column after the export_recombinant_breakpoints function is run, but it seemed cleaner to calculate everything we need at once.

hyanwong commented 1 year ago

Oh bugger, I forgot that simplify() doesn't remove mutations above the root. So simplified_ts.num_mutations() is not correct.

hyanwong commented 1 year ago

We can look at the mutations above the sample nodes in the simplified ts though, so easy to correct in this case, and I have done so in https://github.com/jeromekelleher/sc2ts/commit/309cbf70e747f3a0bacefdf7b79a79b5d4cc6a22

jeromekelleher commented 1 year ago

As discussed yesterday, would be simpler to consider sequence identity here via the variants function. Something like

H = ts.genotype_matrix(samples=[left_parent, right_parent]).T
sequence_diffs = np.sum(H[0] != H[1])