quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.78k stars 651 forks source link

Random Document Sorting #1081

Open cultpony opened 3 years ago

cultpony commented 3 years ago

Is your feature request related to a problem? Please describe.

Tantivy should support random-order sorting on a field.

The goal is to provide users with a randomly sorted collection of documents in relation to a specific query. The results must paginate, so that users can request more documents if they desire. The query must be able to reseed it's randomness for each specific users (seed derivation per user ought to be application level, not on tantivy's level). This must work with a query that returns only 1 document from a given set, to provide a "visit random document" functionality.

Iterating over randomly ordered documents should be fast even for large datasets as well as deterministic. It is not required that this randomness is stable to a specific CPU architecture or version of tantivy.

Describe the solution you'd like

A new collector or query component is introduced with a constructor to set an ordering seed. If set, documents will be randomly ordered based on this seed. If the same seed is used again, the same document set is retrieved. Newly added documents are randomly inserted into the result.

A solution I think would work for tantivy would be to first obtain the count of documents matching a query, then applying a function F(i) that is bijective over [0..n), ie, AES(key=randomness_seed, input=i)%n or a cycle-walking variant:

MASK = 2^(round_up(log_2(n)) // The next largest power of 2 to the document count
loop {
  output = AES(key=randomness_seed, input=i) && MASK // Mask AES to the size of our index
  if output < n:
    // This will return after an average of 2 iterations
    return output
  else
    i = output // encrypt our output again
}

The rationale for choosing AES has no bearing in actual crypotgraphic security and merely in that most CPUs have accelerated instructions available. In fact, using the AES Quarter Round Function on x86 would be sufficient for this use case.

The output of the above solution would then be used to select a document out of the returned query set by index.

The result would be stable of a set of documents and would only require as many random computations as documents returned.

If our topmost collector is TopDocs with a limit of 100, only 100 calls to the randomization functions are necessary, it scales purely to the size of the output (O(k)), where k is the returned document count.

The above collector would be simpler than TopDocs, as it does not require to actually score it's documents and could operate on as few as k documents overall.

[Optional] describe alternatives you've considered

TopDocs can be setup with a custom scoring mechanism. However, this is insufficient as the score would have to be computed for every document in the store, irregardless of it it will end up in the final selection, which would make this result scale poorly in regards to the stored document count. That the query will return fewer documents that the store contains isn't entirely relevant, as the query may be empty, thusly requiring to compute over all documents. This makes this approach O(n), where n is the total document count.

Prior Work

I've based this blogpost on the works of byteslice, who I've talked to about this problem in general, and has released two blogposts on patreon regarding this problem.

byteslice/Random sorting using cryptography byteslice/Validating randomness

Elasticsearch implements this functionality as per "alternatives you've considered" but circumvents the O(n) issue by offloading the workload to multiple nodes, which isn't an option for tantivy.

fulmicoton commented 3 years ago

@cultpony Is it something that you personally need?

cultpony commented 3 years ago

As mentioned above, I need this for one of my projects, yes. Elasticsearch offers randomly ordered document sets and I believe that it's a feature that's not going to be too uncommon in need.

fulmicoton commented 3 years ago

(Apologies this was closed by mistake.)

fulmicoton commented 3 years ago

So method you are offering is interesting and should work on a single node, but it has some problems.

One several node, it would require too much coordination. (no necessarily an extra round, but still)

On a single node, it is still not great due to the current API. Right now we don't let the collector make two passes easily.

It has the merit of making it trivial to compose several collector.

As you mentioned your current problem can be easily solved using the custom score method. https://docs.rs/tantivy/0.15.1/tantivy/collector/struct.TopDocs.html#method.custom_score

If the performance hit is really a problem, please share more information about your project to help us understand the underlying use case and prioritize the issue correctly.

cultpony commented 3 years ago

I'm currently working on a benchmark case for my usage pattern, I'll comment back when this is complete but I cannot provide an ETA for that right now.

Otherwise, you're correct, this is only an advantage if the search is running on a single node.