jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
5 stars 3 forks source link

"Redundant" unary nodes #132

Closed hyanwong closed 1 month ago

hyanwong commented 1 year ago

In https://github.com/jeromekelleher/sc2ts/issues/130#issuecomment-1470438214 the sample_subgraph(712029) plot shows 3 inserted (i.e. non-sample) nodes which appear to be completely redundant: 673654, 709942, and 713241:

Screenshot 2023-03-15 at 19 29 34 Screenshot 2023-03-15 at 19 30 04 Screenshot 2023-03-15 at 19 30 17

I'm not sure how those come about? Is there a node insertion step I'm not understanding properly here?

jeromekelleher commented 1 year ago

Can you show the Report on one of those nodes please?

hyanwong commented 1 year ago

Sure: a few bits missing off the screenshot here, e.g. the right path to root, but I think this should have all you need?

Screenshot 2023-03-16 at 09 43 23 Screenshot 2023-03-16 at 09 43 39 Screenshot 2023-03-16 at 09 43 56
jeromekelleher commented 1 year ago

Hmm, I don't understand that, might be a bug. All non sample nodes should either be recombinants or be supported by at least one mutation. Tentatively marking as bug to be fixed in next dev push.

Might be best to simplify out for Viz purposes if they aren't supported by mutations.

I don't think there's any harm in such nodes, though, nothing to worry about.

hyanwong commented 1 year ago

Tentatively marking as bug to be fixed in next dev push.

Great, glad I picked up something there

Might be best to simplify out for Viz purposes if they aren't supported by mutations.

Yes, agree. It's not obvious to me how to simplify out specific nodes while not marking them as samples, and leaving other unary nodes in. The two ways I can imagine doing it are (a) attaching all the non-sample nodes I want to keep to an individual (could even be the same one, just for this purpose), then simplify with keep_unary_in_individuals=True, or (b) mark the non-sample nodes to keep as samples, then flip the flags back after simplification.

I guess (a) is the least likely to cause unwanted side effects.

a-ignatieva commented 1 year ago

For viz purposes, I can just remove nodes if they aren't recombination nodes and don't have any mutations right?

hyanwong commented 1 year ago

You also want to retain those that have more than one child (I think, either in a single tree or indeed in any trees?). You could get a reversion push node with no mutation above it but 2 children, I think.