Closed ognian- closed 1 year ago
Thanks! Here are some general changes before addressing your questions:
FeatureReader
and FeatureWriter
are replaced by FeatureConstView
and FeatureMutableView
.ExtraFeatureStorage
, which is meant for features on nodes/edges, and is an additional DAG-global storage for the feature. It's currently only used for Deduplicate
's unique storage.Neighbors
for nodes, Endpoints
for edges, and Connections
for the DAG. And answers respectively:
1) Masking out some of the edges is what will be used for trimming and sampling views. Depending on how the view is expected to be used, there are multiple options with varying space/time trade-offs. In any case only the data needed for the topology itself is stored, all heavy annotations like EdgeMutations
/CompactGenome
etc. does not require extra space (not even a pointer). The most compact (but slow) way is to store an ordered vector of edge ids, and use binary search to check if they participate in the subset. With this approach the node and edge ids are no longer contiguous, but that shouldn't be an issue. The approach that matches the performance of a regular DAG is to also store Neighbors
objects, so the only memory saving benefit is from the annotations. We also have room to experiment with fine tuning in between the two extremes, like bloom filter, perfect hash functions, etc.
2) Sorry about parameter names, I've planned to name them better, but decided to leave it for a future iteration, together with improving documentation.
DS - DAG Storage
DV - DAG View
NC - Nodes container
EC - Edges container
F - Feature
Fs - Features pack
Tag - Usually equal to Feature, except for the case of Deduplicate<Feature>
. It is used for making Deduplicate behave as it's parameter in some places, and differently in others.
CRTP - Curiously recurring template pattern - a useful type of compile-time polymorphism idiom. This lets the base class cast itself to the derived class and access data and functions that are expected to be present on deriving classes.
3) Deduplicate<Feature>
can be used on nodes/edges (not on the DAG itself) for any existing Feature
. It stores the unique feature data in a concurrent_ordered_set<Feature>
, and on each node/edge it stores a const Feature*
pointer. Stored features must be read-only, in order to preserve hashing and equality, and the underlying container ensures pointer stability when adding/removing items. Currently it is an append-only implementation, meaning that if nodes/edges are removed and a unique feature is no longer pointed to, it won't be evicted. Reference counting could be added to solve this. Deduplicate removes all FeatureMutableView
functions, and allows only entirely replacing the assigned Feature by operator=(Feature&&)
on the node/view. It would be a nice addition to allow Deduplicate to use external storage, so multiple DAGs can share the same pool of unique annotations.
4) I did have merging use Deduplicate seamlessly for CompactGenome and LeafSet in an earlier design iteration, and will bring it back when adding CladeUnion to LeafSet for issue #30.
Thanks Ognian! This all looks great to me
Adding the ability to extend an existing DAG with additional features. Also includes various improvements in the DAG storage-view design.