opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.47k stars 1.74k forks source link

[Feature Request] Support BloomFilter field type in OpenSearch #6913

Open penghuo opened 1 year ago

penghuo commented 1 year ago

Descriptioin

The feature request is to support the Bloom Filter field type to optimize index storage. The use case for this feature is to create skipping indexes for raw datasets on S3. For example, given 1M files on S3, each file containing 1M records with a name column, a Bloom Filter index can be created on the name field for each file. This would enable quick identification of which file(s) contain the required name, allowing for post-processing of the results. The accuracy of the result does not need to be 100%, as the post-processing will process each record and find out which record(s) match the required name exactly.

Solution in my mind

{
  "name": {
    "type": "bloomfilter",
    "expectedInsertions":1000  // https://guava.dev/releases/23.0/api/docs/com/google/common/hash/BloomFilter.html
    "fpp":0.01
  }
}
PUT /bloomfilerindex/_doc/1
{
  "name": ["bob", "alice", "david"]
}

PUT /bloomfilerindex/_doc/2
{
  "name": ["bob", "alice"]
}
GET /_search
{
  "query": {
    "mightContain": {
      "name": ["david"],
    }
  }
}

# The query may return
{
  "hits": {
    "total": {
      "value": 2,
      "relation": "mightContain"
    },
    "max_score": 1,
    "hits": [
      {
        "_id": "1"
      },
      {
        "_id": "2"  <- false positive
      }
    ]
  }
}

alternatives I've considered

The alternative solution is to use an existing array field and terms query, but it is unclear what the index size and query performance would be when compared to using Bloom Filters. for example

// create index mapping
{
  "name": {
    "type": "keyword"
  }
}

// query
GET /_search
{
  "query": {
    "terms": {
      "name": ["david"],
    }
  }
}

Additional context

There have been discussions about supporting bitsets as a data type and bitwise operators in Lucene and Elasticsearch, but most of the related issues are still open.

Open Questions

  1. Can the use of Bloom Filter fields optimize index size compared to using arrays? What are the performance tradeoffs between the two approaches?
dblock commented 1 year ago

Could the query be a new type of boolean "maybe" query?

penghuo commented 1 year ago

Could the query be a new type of boolean "maybe" query?

Yes, make sense!

macohen commented 1 year ago

@penghuo were you planning to build this and looking for feedback?

penghuo commented 1 year ago

@penghuo were you planning to build this and looking for feedback?

I want to get clarification on the following question. If the bloom filters types make sense, I can contribute.

Can the use of Bloom Filter fields optimize index size compared to using arrays? What are the performance tradeoffs between the two approaches?

Jeevananthan-23 commented 1 year ago

The Problem

Improving performance (in the context of searching large documents). If each document is 10kB, that's 10 petabytes of data! How do you query 10PB of data in a fraction of a second?

The Solution

A bloom filter is a read-only probabilistic set. It's useful for verifying a key exists in a set, though it returns false positives. It plays a critical role in optimizing point queries and posting list terms in inverted index storage systems like Lucene.

The Motivation

Bukhtawar commented 1 year ago

Not sure if you have taken a look but I guess we already use a variation of bloom filter named, CuckooFilter

macohen commented 1 year ago

@penghuo do you see differences between your request and the CuckooFilter that lead you to think BloomFilter is still needed?

shwetathareja commented 1 year ago

There is internal usage of Cuckoo filter but it can't be used as field type to change the internal storage as requested by @penghuo . Correct me if i am missing something?

macohen commented 1 year ago

can Cuckoo filter be adapted to a field?

model-collapse commented 2 weeks ago

I am going wake this issue up since we are also exploring bitmap based semantic indexing. According to some rough estimation bitsets can produce 10% index size comparing to regular posting list. Taking neural sparse as an example, if we binarize the term weight into bits and use bitsets as datatype, we can drop the index size from 6x bm25 size to 60%. If we employ a bigger vocabulary which compensates the index size back to 2x bm25 size, we can roughly obtain at least 2% increase on the NDCG@10 BEIR. It is an early stage thinking, but deserves a PoC on both eng and sci part.