quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.2k stars 336 forks source link

Nesting in Quickwit #5502

Open fulmicoton opened 1 week ago

fulmicoton commented 1 week ago

The problem

Right now, quickwit offers a dynamic json and object to represent deep structure. The tokens emitted are however flattened.

Taking the example of documents holding a t-shirt shopping cart:

cart_id: 234234
products:
  - product_id: 13
    model_attributes:
      color: blue
      size: L
  - product_id: 15
    model_attributes:
      color: red
      size: M
...

The inverted index will record tokens in a flattened way. We have a single document, receiving the tokens:

- model_attributes.color:blue
- model_attributes.color:red
- model_attributes.size:M
- model_attributes.size:L

The relationship between blue and L, and red and M is lost, and it is not possible to query for carts containing a blue L-size T-Shirt.

nested documents makes it possible to fix this problem.

User experience

The solution described here is similar to that of Lucene.

The user needs to explicitly define the nested field in its doc mapping:

doc_mapping:  
    field_mappings:  
    # ...
    - name: products
      type: nested
      field_mappings:
          - name: model_attributes
            type: object
            field_mappings:
                - name: color
                  type: text
                  index: true
                  tokenizer: raw
                  fast: true
                  stored: true
                - name: size
                  type: text
                  index: true
                  tokenizer: raw
                  fast: true
                  stored: true

The user can then express the query in the first section using elasticsearch/opensearch nested queries as follows:

{
  "query": {
    "nested": {
      "path": "products",
      "query": {
        "bool": {
          "must": [
            { "match": { "products.model_attributes.color": "blue" } },
             { "match": { "products.model_attributes.size": "L" } }
          ]
        }
      },
          "score_mode": "avg"
    }
  }
}

The score from the nested query will be an aggregate of the scores of the matching nested documents. The aggregate is defined in score_mode.

Implementation

Document and schema

In the example above, the key idea is that a single document will be exploded into documents with consecutive doc ids, and indexed together. (While it does not matter all that much in quickwit, it is possible to append document in batch in tantivy. The document are then guaranteed to land in the same segment contiguously).

The item documents:

# Doc ID N
products:
    product_id: 13
        model_attributes:
          color: blue
          size: L
# Doc ID N+1
products:
    product_id: 15
        model_attributes:
          color: red
          size: M

Followed by the parent document

# Doc ID N+2
cart_id: 234234

The fields will only be stored in the parent document.

In fact the existence of tantivy documents for these nested documents is an implementation detail that should never leak to the user.

An internal tantivy field called _parent, will store the parent child relation ship as follows.

A posting list associated to the term _parent:products will mark all of the parent documents.

Given a match nested doc id, finding the parent will then just be a matter of calling seek on that posting list DocSet.

Since the schema of the nested fields and the schema of the parent document (and its stored nested fields) must coexist, we need to have two distinct fields in the tantivy schema.

products will be a JSON stored field that will contain the subset of fields marked as stored in the nested documents. When cumulating nested fields, only the root one will receive this JSON document.

The nested document indexed fields will then be prepend not by _nested..

Nested query

{
  "query": {
    "nested": {
      "path": "products",
      "query": {
        "bool": {
          "must": [
            { "match": { "products.model_attributes.color": "blue" } },
             { "match": { "products.model_attributes.size": "L" } }
          ]
        }
      },
          "score_mode": "avg"
    }
  }
}

Will result in building a tantivy query that looks like

FoldToParentQuery {
   underlying: BoolQuery {
      must: vec![
          ...
      ]
      # ...
   },
   field: products
}

In spirit, to go forward, the fold to parent query will be advancing the underlying query. Seek its parent doc using the _parent.products posting list, advance the underlying query so as to go through all of the matched nested documents, and aggregate their score.

We need to confirm that this aggregation is over only the children matching the query.

Nested scoring

TBD. we might need a parent_doc_id -> child doc ids[] function. In turn, due to deep nesting, this may require introducing empty parent document.

Nested aggregation

TBD. we might need a parent_doc_id -> child doc ids[] function. In turn, due to deep nesting, this may require introducing empty parent document.

philippemnoel commented 1 week ago

Love this :) This is similar to the design we are exploring for ParadeDB