jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Weirdness in reporting some recombination nodes #121

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

In the long ARG (upgma-mds-1000-md-30-mm-3-2022-06-30-recinfo-il.ts.tsz):

import tszip
import sc2ts
def report(*args, **kwargs):
    display(*ti_long.node_report(*args, **kwargs))

ts = tszip.decompress("upgma-mds-1000-md-30-mm-3-2022-06-30-recinfo-il.ts.tsz")
ti = sc2ts.TreeInfo(ts)
report(589269)

It seems to be wrongly reporting a node 560878 as one of the parents (and only showing 2 parents - I think 427216 and 560878, when the forward HMM pass actually finds 3: 427216, 582220, and 564823)

Screenshot 2023-03-09 at 00 25 53
hyanwong commented 1 year ago

My suspicion is that the HMM info metadata does not reporting the matching pattern which ends up in the final tree sequence? Is this because of post-processing of the ts after the HMM pass?

If that's true, and the post-processing can remove breakpoints, then we shouldn't report information about MRCAs on breakpoints mentioned in the node metadata but subsequently removed from the tree sequence.

hyanwong commented 1 year ago

I've just checked, and about 11% of the breakpoints reported in the metadata for the forward HMM don't have parents which agree with the parents of that node in that position in the tree sequence. https://github.com/jeromekelleher/sc2ts/pull/123 is a PR which adds this info to the breakpoints report

jeromekelleher commented 1 year ago

Riiiight. These are all 3-or-more break recombinants where the likelihood of 3 breaks and x mutations is very close to the likely hood of 2-breaks and x+3 mutations. Looks like the HMM params might have been very slightly different between the tree-building and the subsequent analysis to add the HMM metadata.

Our lives really would be much simpler if we could restrict to 1-break HMM consistent recombinants (for now)...