proteneer / timemachine

Differentiate all the things!
Other
140 stars 17 forks source link

Best-first MCS search #1415

Closed mcwitt closed 3 weeks ago

mcwitt commented 3 weeks ago

Modifies the search strategy in the maximum common substructure (MCS) search (used during atom mapping) to greatly reduce the number of nodes that must be searched in certain problem instances.

Background

The current approach is based on McGregor 1982, and proceeds by depth-first search over all possible mappings A -> Optional[B].

The tree is pruned according to an upper bound on the number of edges that are required to be preserved by the mapping (in the sense that if a1 maps to b1 and a2 maps to b2, then adj(a1, a2) == adj(b1, b2)) as follows:

  1. Initially, we search for a mapping that preserves the maximum possible number of edges, i.e. set min_num_edges = min(edges in A, edges in B). During DFS, we prune branches whose num_edges_upper_bound is less than min_num_edges.
  2. If no mapping is found, we decrement min_num_edges by one and restart DFS.
  3. Repeat (2) until a mapping is found.

This strategy, while efficient in cases where the number of edges preserved by an optimal mapping is close min(edges in A, edges in B), requires searching a large number of nodes when this is not the case.

Changes

This PR changes the search strategy as follows:

  1. Removes the outer loop over decreasing min_num_edges. By itself, this change is catastrophically inefficient, since we no longer have any mechanism for pruning the search tree.
  2. Keeps track of the leaf node with the largest number of preserved edges during the search (as in McGregor 1982); this is used to prune branches of the tree according to their upper bound. This recovers most of the performance lost in (1), but is still 2-3x slower compared with the existing approach according to small-molecule benchmarks. Note that to guarantee optimality of the solution (in the sense of number of edges preserved), we must perform an exhaustive search over all leaf nodes; only then can we determine the optimal subset.
  3. Modifies the search ordering from depth-first to a best-first variant. We use as a heuristic the same upper bound on the number of edges preserved by the mapping as above. Note that this ordering guarantees that the first node returned is optimal; this could be exploited further (not in this PR) by incorporating more of the post-search ranking heuristics into the search heuristic, allowing termination the search when the first solution is found and potentially further performance gains.

Validation

Public FEP Benchmarks

a = master b = this branch

For each instance, mapping generated by B verified to be identical to that generated by A.

a b
count 1301 1301
mean 0.132628 0.116224
std 0.284466 0.127008
min 0.0223501 0.0273046
25% 0.0658681 0.0691361
50% 0.0887287 0.0941987
75% 0.11967 0.119321
max 5.44522 1.95834
mcwitt commented 3 weeks ago

Also, atom_mapping.py is not currently in the diff, but I think there may be points in this documentation that should be updated:

Updated in https://github.com/proteneer/timemachine/pull/1415/commits/11610994e827997ab78cc778d596a3fc4b19987b and https://github.com/proteneer/timemachine/pull/1415/commits/65fc2aad961a3f03645779921ce65cf2a7eff014

proteneer commented 3 weeks ago

this PR is :fire::fire::fire::fire::fire: