jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Node with ARG breakpoint not present in either HMM run #160

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

Node 606551 in the long ARG has a breakpoint at position 26577 but does not have that position listed as a switch point in either of the HMM runs:

ts = tszip.decompress("../data/upgma-mds-1000-md-30-mm-3-2022-06-30-recinfo-gisaid-il.ts.tsz")
for e in np.where(ts.edges_child==606551)[0]:
    print(ts.edge(e))

print()

print({mi['direction']: mi['breakpoints'] for mi in ts.node(606551).metadata['match_info']})

Gives:

Edge(left=23202.0, right=26577.0, parent=587758, child=606551, metadata=b'', id=143794)
Edge(left=0.0, right=5386.0, parent=582625, child=606551, metadata=b'', id=148578)
Edge(left=26577.0, right=29904.0, parent=568250, child=606551, metadata=b'', id=165591)
Edge(left=5386.0, right=23202.0, parent=338208, child=606551, metadata=b'', id=400388)

{'backward': [0, 4789, 7927, 13196, 22918, 26531, 27808, 29904], 'forward': [0, 5386, 8393, 14014, 23202, 26767, 28271, 29904]}
hyanwong commented 1 year ago

Also true for node 246096 / breakpoint 21137.0

There are no nodes in the wide ARG in which the actual breakpoints do not match any in the HMM-reported metadata.

hyanwong commented 1 year ago

I guess these could be cases where the recombination node has a descendant path to another recombination nodes which does not contain any samples, so that a breakpoint from a child RE node is percolated upwards to the parent RE node. Could that happen?

jeromekelleher commented 1 year ago

It's a case where there's so much uncertainty that the forward hmm doesn't quite agree with the arg one. Probably tiny numerical difference in param values used.

I bet the arg version has more mutations

It's all very messy

hyanwong commented 1 year ago

Ah, right. I forgot that the forward params were a rerun, so are subject to numerical differences.