Open penghuo opened 1 year ago
Could the query be a new type of boolean "maybe" query?
Yes, make sense!
@penghuo were you planning to build this and looking for feedback?
@penghuo were you planning to build this and looking for feedback?
I want to get clarification on the following question. If the bloom filters types make sense, I can contribute.
Can the use of Bloom Filter fields optimize index size compared to using arrays? What are the performance tradeoffs between the two approaches?
Improving performance (in the context of searching large documents). If each document is 10kB, that's 10 petabytes of data! How do you query 10PB of data in a fraction of a second?
A bloom filter is a read-only probabilistic set. It's useful for verifying a key exists in a set, though it returns false positives. It plays a critical role in optimizing point queries and posting list terms in inverted index storage systems like Lucene.
Not sure if you have taken a look but I guess we already use a variation of bloom filter named, CuckooFilter
@penghuo do you see differences between your request and the CuckooFilter that lead you to think BloomFilter is still needed?
There is internal usage of Cuckoo filter but it can't be used as field type to change the internal storage as requested by @penghuo . Correct me if i am missing something?
can Cuckoo filter be adapted to a field?
I am going wake this issue up since we are also exploring bitmap based semantic indexing. According to some rough estimation bitsets can produce 10% index size comparing to regular posting list. Taking neural sparse as an example, if we binarize the term weight into bits and use bitsets as datatype, we can drop the index size from 6x bm25 size to 60%. If we employ a bigger vocabulary which compensates the index size back to 2x bm25 size, we can roughly obtain at least 2% increase on the NDCG@10 BEIR. It is an early stage thinking, but deserves a PoC on both eng and sci part.
Descriptioin
The feature request is to support the Bloom Filter field type to optimize index storage. The use case for this feature is to create skipping indexes for raw datasets on S3. For example, given 1M files on S3, each file containing 1M records with a name column, a Bloom Filter index can be created on the name field for each file. This would enable quick identification of which file(s) contain the required name, allowing for post-processing of the results. The accuracy of the result does not need to be 100%, as the post-processing will process each record and find out which record(s) match the required name exactly.
Solution in my mind
alternatives I've considered
The alternative solution is to use an existing array field and terms query, but it is unclear what the index size and query performance would be when compared to using Bloom Filters. for example
Additional context
There have been discussions about supporting bitsets as a data type and bitwise operators in Lucene and Elasticsearch, but most of the related issues are still open.
Open Questions