Identifying PC nodes / edges added during match_samples

hyanwong commented 6 months ago

I think we might want to release both the ancestors tree sequences and the fully simplified tree sequences from any real inference that we do, so that people can match their own samples against the ancestors.

However, it's likely that as well as matching their own samples against the ancestors, they will also want to place the original samples back on. For this, we need to be able to identify which edges were added during the sample matching process, and simply re-add them.

However, I'm not sure how we can identify the sample-matched edges, given an ancestors_ts and the final ts. It's reasonably obvious to do when there aren't PC nodes (the added edges are simply the ones above the sample nodes), but once you have PC nodes, it's more difficult. In particular, we currently can't identify which PC nodes in the final tree sequence correspond to PC nodes that were added during ancestor matching, and which correspond to PC nodes added during sample matching.

I've thought about it for a bit, and perhaps the easiest would be to add metadata to PC nodes added during sample matching, specifying the node ID they represent in the ancestors_ts. At the moment we set ancestor_data_id in the metadata of non-PC nodes in the ancestors TS. I wonder if we should set ancestor_ts_id for the PC nodes?

jeromekelleher commented 6 months ago

We should be adding richer metadata about PC nodes all right, we want to communicate back information about how and why nodes were added.

hyanwong commented 6 months ago

I'm thinking that it would be useful to add a flag either to all the nodes that have been place in the match-ancestors phase (both ancestors and PC nodes) OR a flag to all the nodes placed in the match-samples phase. Flipping a flag is easy and cheap, so I can't see any objection to this: do you have any preference for whether such a flag would be on the match-ancestors or the match-samples nodes @jeromekelleher ?

That way it's trivial to identify all the match-sample nodes, remove them from the (unsimplified) final TS, remove the non-inference sites, and you are left with the ancestors tree sequence, which can be used again for matching (e.g. a different set of samples)

jeromekelleher commented 6 months ago

Sure, SGTM. I think we have plenty room in flag space, so whatever works best. More info is better I think.

tskit-dev / tsinfer

Identifying PC nodes / edges added during match_samples #916