Best-first MCS search - Githubissues

mcwitt commented 3 weeks ago

Modifies the search strategy in the maximum common substructure (MCS) search (used during atom mapping) to greatly reduce the number of nodes that must be searched in certain problem instances.

Background

The current approach is based on McGregor 1982, and proceeds by depth-first search over all possible mappings A -> Optional[B].

The tree is pruned according to an upper bound on the number of edges that are required to be preserved by the mapping (in the sense that if a1 maps to b1 and a2 maps to b2, then adj(a1, a2) == adj(b1, b2)) as follows:

Initially, we search for a mapping that preserves the maximum possible number of edges, i.e. set min_num_edges = min(edges in A, edges in B). During DFS, we prune branches whose num_edges_upper_bound is less than min_num_edges.
If no mapping is found, we decrement min_num_edges by one and restart DFS.
Repeat (2) until a mapping is found.

This strategy, while efficient in cases where the number of edges preserved by an optimal mapping is close min(edges in A, edges in B), requires searching a large number of nodes when this is not the case.

Changes

This PR changes the search strategy as follows:

Removes the outer loop over decreasing min_num_edges. By itself, this change is catastrophically inefficient, since we no longer have any mechanism for pruning the search tree.
Keeps track of the leaf node with the largest number of preserved edges during the search (as in McGregor 1982); this is used to prune branches of the tree according to their upper bound. This recovers most of the performance lost in (1), but is still 2-3x slower compared with the existing approach according to small-molecule benchmarks. Note that to guarantee optimality of the solution (in the sense of number of edges preserved), we must perform an exhaustive search over all leaf nodes; only then can we determine the optimal subset.
Modifies the search ordering from depth-first to a best-first variant. We use as a heuristic the same upper bound on the number of edges preserved by the mapping as above. Note that this ordering guarantees that the first node returned is optimal; this could be exploited further (not in this PR) by incorporating more of the post-search ranking heuristics into the search heuristic, allowing termination the search when the first solution is found and potentially further performance gains.

Validation

Added a golden test to ensure that atom mappings produced by the new version are identical to those produced by the existing version (https://github.com/proteneer/timemachine/pull/1415/commits/67e363bd1efc507c2558508ffb5a4efec606c61b)
Verified that ~core sizes are unchanged~ mappings are identical on large internal repository of examples

Public FEP Benchmarks

a = master b = this branch

For each instance, mapping generated by B verified to be identical to that generated by A.

	a	b
count	1301	1301
mean	0.132628	0.116224
std	0.284466	0.127008
min	0.0223501	0.0273046
25%	0.0658681	0.0691361
50%	0.0887287	0.0941987
75%	0.11967	0.119321
max	5.44522	1.95834

mcwitt commented 3 weeks ago

Also, atom_mapping.py is not currently in the diff, but I think there may be points in this documentation that should be updated:

Updated in https://github.com/proteneer/timemachine/pull/1415/commits/11610994e827997ab78cc778d596a3fc4b19987b and https://github.com/proteneer/timemachine/pull/1415/commits/65fc2aad961a3f03645779921ce65cf2a7eff014

proteneer commented 3 weeks ago

this PR is :fire::fire::fire::fire::fire:

proteneer / timemachine

Best-first MCS search #1415

Background

Changes

Validation

Public FEP Benchmarks