meilisearch / milli

Search engine library for Meilisearch ⚡️
MIT License
464 stars 81 forks source link

Improving the BestProximity algorithm #1

Closed Kerollmops closed 4 years ago

Kerollmops commented 4 years ago

The best proximity algorithm is a system that returns the best proximities between positions of the different query words.

How the engine works

Consider that the search engine have indexed many documents and those documents contain the words "eminem", "rap" and "god". The documents ids will be stored under each of the words apparition under the positions they appear.

If a user searches for "eminem rap god" then this is what the engine will be able to retrieve from the key-value store, a list associated to the query words.

In the example above the engine shows that the word "eminem" appears in many different positions in the documents: in the first attribute at index 0 but also in position 2 and 4 (i.e. 0, 2 and 4). it also appears in the second attribute at index 0 and 1 (i.e. 1000 and 1001).

eminem: [0,    2,    4,       1000, 1001]
rap:    [0, 1, 2, 3,    5,                1002, 1003]
god:    [0, 1, 2, 3,       6, 1000, 1001,       1003]

These numbers doesn't represent documents ids. Once the engine have found the best proximity between these positions it retrieves the documents associated to these words at these exact positions. For example the word "eminem" at position 1000 can contain documents 30, 32, 35 and 72.

Splitting documents ids related to a given word reduces the number of documents ids to do intersections on and allows to return the best documents for a given query (by proximity and position) on the fly.

Once these documents ids have been retrieved the engine will do a simple intersection between them to know which documents are concerned.

Improve the generated successors

We use a function that produces successors of a given path between the differents words positions. This successors function does not known if a positions combination (an intersection) gives documents or not, we must introduce this feature to avoid producing successors paths when not documents will be considered.

Keep the graph exploration algorithm state between iterations

We execute a graph exploration algorithm (dikjstra, A-star) to retrieve the minimum proximity we can find between the different words and positions the engine retrieved for the user query.

This algorithm will return the best path related to its computed proximity but if not enough documents results from this execution we execute it a second time by ignoring paths that we have already seen in the previous iteration. Doing so means that we droped the whole data structures we filled from the previous iteration to re-explore the same nodes, this is wasted time. We must keep the open closed list.

Kerollmops commented 4 years ago

Improvement of the generated successors

I am wondering how I can express and solve a graph theory problem. I am not sure how I can express it in terms of graph and edges.

The problem

So here is my problem, I have multiple nodes, a, b and c here, that can be in different positions.

a: [0,    3, 4,           1000, 1001]
b: [1, 2, 3,    5,              1001, 1002]
c: [   2, 3, 4,    6, 10,       1001, 1002, 1003, 2000]

A node position is composed of two parts:

A path length can be computed this way:

Examples

a[0], b[1] and c[2] gives a length of 2 because the length between a[0] and b[1] is 1 and the length between b[1] and c[2] is 1.

a[0], b[1001] and c[1002] gives a length of 9 because the length between a[0] and b[1001] is 8 and the length between b[1001] and c[1002] is 1.

a[3], b[3] and c[2] gives a length of 2 because the length between a[3] and b[3] is 0 and the length between b[3] and c[2] is 2.

Here is a Rust code to help understand ```rust const MAX_DISTANCE: u32 = 8; fn index_proximity(lhs: u32, rhs: u32) -> u32 { if lhs <= rhs { cmp::min(rhs - lhs, MAX_DISTANCE) } else { cmp::min(lhs - rhs, MAX_DISTANCE) + 1 } } fn positions_proximity(lhs: u32, rhs: u32) -> u32 { let (lhs_attr, lhs_index) = (lhs / 1000, lhs % 1000); let (rhs_attr, rhs_index) = (rhs / 1000, rhs % 1000); if lhs_attr != rhs_attr { MAX_DISTANCE } else { index_proximity(lhs_index, rhs_index) } } ```

What I would like to retrieve

I would like to retrieve the N shortests paths the first time I call the algorithm then be able to call it again to retrieve the next shortests paths (paths that are just longer than the latest returned).

I know that I will need to rewrite my how A* algorithm myself. I will be able to keep the open and closed list to retrieve the shortest paths iteratively. Once the shortest path has been found I return all paths with the same shortest length that are not in the closed list, I just need to create an Iterator where the open and closed lists are in the state of the Iterator. What do you think?

How can I represent this problem?

First attempt

I first tried to represent this problem by generating the new path in the successors function. The start path was [a[0]] and the successors of it were:

But I am not sure it is the right way to represent the problem as the path returned by A* is now a path of paths... and I have to retrieve only the last element of the path, which represent the solution.

Note that it kind of worked with dijkstra but I had to call it multiple times and return false in the success function when a path was already returned. This was my external closed list and this was not a good solution at all, the graph was reexplored each time I ran the algorithm...

Second attempt

I saw the astar_bag and astar_bag_collect function which seemed to fit my needs a little bit better than dijkstra because it was able to return the the list of all shortests paths at once. No need to run the algorithm a second time to retrieve the next shortest path and reexplore the graph.

I tried to make the successors function not return entire paths but moves. I returned a struct containing the layer at which the node was located, the position selected.

So a path is only valid if the layer which that node represent is the latest one (i.e the c layer). A node can only move to the next layer, this is how the successors function works, it will output all possible moves from the given node position to all the positions of the next layer along with the cost to moving to it.

There is just a strange thing: I can have multiple starts. Therefore my start node is an uninit node, when the successors function is called on it, it will produce all the nodes of the first layer. Is this correct? I always ignore the first node of the returned paths as it is an "invalid" one.

This attempt work perfectly 🎉

Now I need to be able to keep the state of the algorithm to be able to iterate another time and find the second shortests paths.


Here is a Rust code that work for this problem ```rust use std::cmp; use pathfinding::directed::astar::astar_bag; const MAX_DISTANCE: u32 = 8; fn index_proximity(lhs: u32, rhs: u32) -> u32 { if lhs <= rhs { cmp::min(rhs - lhs, MAX_DISTANCE) } else { cmp::min(lhs - rhs, MAX_DISTANCE) + 1 } } fn positions_proximity(lhs: u32, rhs: u32) -> u32 { let (lhs_attr, lhs_index) = (lhs / 1000, lhs % 1000); let (rhs_attr, rhs_index) = (rhs / 1000, rhs % 1000); if lhs_attr != rhs_attr { MAX_DISTANCE } else { index_proximity(lhs_index, rhs_index) } } #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)] struct Node { init: bool, layer: usize, position: u32, } fn main() { let positions = &[ &[0, 4, 1000][..], // eminem &[ 1, 5, ], // rap &[ 4, 7, 1000], // god ]; let start = Node { init: false, layer: 0, position: 0 }; let (solutions, cost) = astar_bag( &start, |n| if !n.init { positions[0].iter().map(|p| (Node { init: true, layer: 0, position: *p }, 0)).collect() } else if n.layer == positions.len() - 1 { vec![] // We reached the highest layer } else { let layer = n.layer + 1; positions[layer].iter().map(|p| { let proximity = positions_proximity(n.position, *p); (Node { init: true, layer, position: *p }, proximity) }).collect() }, |_| 0, |n| n.layer == positions.len() - 1, ) .unwrap(); let mut solutions: Vec<_> = solutions.collect(); solutions.sort_unstable(); dbg!(cost, solutions); } ````
Kerollmops commented 4 years ago

Another thing to notice is that I see way better performances for "real kylie minogue lives" (398ms) than with "minogue kylie lives real" (1267ms).

Kerollmops commented 4 years ago

Outdated, we now use a word pair proximity docids database combined with a custom depth first search (called mana DFS), this database stores a key of the form word1-word2-P where P is the proximity of these two words from 1 to 7 inclusive, docids are stored under this key if they have those two words at this exact proximity.