matsengrp / larch

Inference and manipulation of history DAGs
2 stars 2 forks source link

Extendable dag #42

Closed ognian- closed 1 year ago

ognian- commented 1 year ago

Adding the ability to extend an existing DAG with additional features. Also includes various improvements in the DAG storage-view design.

ognian- commented 1 year ago

Thanks! Here are some general changes before addressing your questions:

And answers respectively:

1) Masking out some of the edges is what will be used for trimming and sampling views. Depending on how the view is expected to be used, there are multiple options with varying space/time trade-offs. In any case only the data needed for the topology itself is stored, all heavy annotations like EdgeMutations/CompactGenome etc. does not require extra space (not even a pointer). The most compact (but slow) way is to store an ordered vector of edge ids, and use binary search to check if they participate in the subset. With this approach the node and edge ids are no longer contiguous, but that shouldn't be an issue. The approach that matches the performance of a regular DAG is to also store Neighbors objects, so the only memory saving benefit is from the annotations. We also have room to experiment with fine tuning in between the two extremes, like bloom filter, perfect hash functions, etc.

2) Sorry about parameter names, I've planned to name them better, but decided to leave it for a future iteration, together with improving documentation.

DS - DAG Storage DV - DAG View NC - Nodes container EC - Edges container F - Feature Fs - Features pack Tag - Usually equal to Feature, except for the case of Deduplicate<Feature>. It is used for making Deduplicate behave as it's parameter in some places, and differently in others. CRTP - Curiously recurring template pattern - a useful type of compile-time polymorphism idiom. This lets the base class cast itself to the derived class and access data and functions that are expected to be present on deriving classes.

3) Deduplicate<Feature> can be used on nodes/edges (not on the DAG itself) for any existing Feature. It stores the unique feature data in a concurrent_ordered_set<Feature>, and on each node/edge it stores a const Feature* pointer. Stored features must be read-only, in order to preserve hashing and equality, and the underlying container ensures pointer stability when adding/removing items. Currently it is an append-only implementation, meaning that if nodes/edges are removed and a unique feature is no longer pointed to, it won't be evicted. Reference counting could be added to solve this. Deduplicate removes all FeatureMutableView functions, and allows only entirely replacing the assigned Feature by operator=(Feature&&) on the node/view. It would be a nice addition to allow Deduplicate to use external storage, so multiple DAGs can share the same pool of unique annotations.

4) I did have merging use Deduplicate seamlessly for CompactGenome and LeafSet in an earlier design iteration, and will bring it back when adding CladeUnion to LeafSet for issue #30.

willdumm commented 1 year ago

Thanks Ognian! This all looks great to me