matsengrp / historydag

https://matsengrp.github.io/historydag
GNU General Public License v3.0
0 stars 1 forks source link

Estimate RF distance diameter #58

Open willdumm opened 1 year ago

willdumm commented 1 year ago

True RF distance diameter:

For a DAG which doesn't contain too many histories, it's possible to compute the true RF distance diameter with something like:

max(dag.optimal_rf_distance(history, optimal_func=max) for history in dag)

However, this is prohibitively slow for large DAGs. It's possible that there's a way to get a good estimate for diameter much more efficiently.

Under-estimate for diameter:

We can always get the maximum RF distance between a reference history and any other history in a DAG with

dag.optimal_rf_distance(ref_history, optimal_func=max)

This gives an under-estimate for the diameter, but possibly not a very good one depending on our choice of ref_history. I propose it may be pretty good if the reference history is a topological outlier, such as what we get from dag.trim_optimal_sum_rf_distance(dag, optimal_func=max).sample()

Over-estimate for diameter:

Since RF distance is a proper distance, the triangle inequality tells us that given any tree a in the DAG, and a tree b that's as far as possible from a in the DAG, the diameter of the set of trees in the DAG is no more than twice the distance from a to b. This is true regardless of the tree a that we start with, but which tree we choose for a will determine how close the over-estimate (twice the distance from a to b) is to the true value of diameter.

It seems that a good choice for a would be the topological median tree, because that should have generally smaller maximum distance to any other tree in the DAG. So, perhaps a good over-estimate of RF distance diameter for a dag would be

2 * dag.optimal_rf_distance(dag.trim_optimal_sum_rf_distance(dag, optimal_func=min).sample(), optimal_func=max)

Evaluating these estimates:

It would be interesting to implement these over/under-estimates, and compare them to the true diameters for a bunch of DAGs that are small enough to compute the true diameters of.