The score from the nested query will be an aggregate of the scores of the matching nested documents. The aggregate is defined in score_mode.
Implementation
Document and schema
In the example above, the key idea is that a single document will be exploded into documents with consecutive doc ids, and indexed together. (While it does not matter all that much in quickwit, it is possible to append document in batch in tantivy. The document are then guaranteed to land in the same segment contiguously).
The item documents:
# Doc ID N
products:
product_id: 13
model_attributes:
color: blue
size: L
# Doc ID N+1
products:
product_id: 15
model_attributes:
color: red
size: M
Followed by the parent document
# Doc ID N+2
cart_id: 234234
The fields will only be stored in the parent document.
In fact the existence of tantivy documents for these nested documents is an implementation detail that should never leak to the user.
An internal tantivy field called _parent, will store the parent child relation ship as follows.
A posting list associated to the term
_parent:products will mark all of the parent documents.
Given a match nested doc id, finding the parent will then just be a matter
of calling seek on that posting list DocSet.
Since the schema of the nested fields and the schema of the parent document (and its stored nested fields) must coexist, we need to have two distinct fields in the tantivy schema.
products will be a JSON stored field that will contain the subset of fields marked as stored in the nested documents.
When cumulating nested fields, only the root one will receive this JSON document.
The nested document indexed fields will then be prepend not by
_nested..
In spirit, to go forward, the fold to parent query will be advancing the underlying query.
Seek its parent doc using the _parent.products posting list, advance the underlying query so as to go through all of the matched nested documents, and aggregate their score.
We need to confirm that this aggregation is over only the children matching the query.
Nested scoring
TBD. we might need a parent_doc_id -> child doc ids[] function.
In turn, due to deep nesting, this may require introducing
empty parent document.
Nested aggregation
TBD. we might need a parent_doc_id -> child doc ids[] function.
In turn, due to deep nesting, this may require introducing
empty parent document.
The problem
Right now, quickwit offers a dynamic json and object to represent deep structure. The tokens emitted are however flattened.
Taking the example of documents holding a t-shirt shopping cart:
The inverted index will record tokens in a flattened way. We have a single document, receiving the tokens:
The relationship between blue and L, and red and M is lost, and it is not possible to query for carts containing a blue L-size T-Shirt.
nested documents makes it possible to fix this problem.
User experience
The solution described here is similar to that of Lucene.
The user needs to explicitly define the nested field in its doc mapping:
The user can then express the query in the first section using elasticsearch/opensearch nested queries as follows:
The score from the nested query will be an aggregate of the scores of the matching nested documents. The aggregate is defined in
score_mode
.Implementation
Document and schema
In the example above, the key idea is that a single document will be exploded into documents with consecutive doc ids, and indexed together. (While it does not matter all that much in quickwit, it is possible to append document in batch in tantivy. The document are then guaranteed to land in the same segment contiguously).
The item documents:
Followed by the parent document
The fields will only be stored in the parent document.
In fact the existence of tantivy documents for these nested documents is an implementation detail that should never leak to the user.
An internal tantivy field called
_parent
, will store the parent child relation ship as follows.A posting list associated to the term
_parent:products
will mark all of the parent documents.Given a match nested doc id, finding the parent will then just be a matter of calling
seek
on that posting listDocSet
.Since the schema of the nested fields and the schema of the parent document (and its stored nested fields) must coexist, we need to have two distinct fields in the tantivy schema.
products
will be a JSON stored field that will contain the subset of fields marked as stored in the nested documents. When cumulating nested fields, only the root one will receive this JSON document.The nested document indexed fields will then be prepend not by
_nested.
.Nested query
Will result in building a tantivy query that looks like
In spirit, to go forward, the fold to parent query will be advancing the underlying query. Seek its parent doc using the
_parent.products
posting list, advance the underlying query so as to go through all of the matched nested documents, and aggregate their score.We need to confirm that this aggregation is over only the children matching the query.
Nested scoring
TBD. we might need a parent_doc_id -> child doc ids[] function. In turn, due to deep nesting, this may require introducing empty parent document.
Nested aggregation
TBD. we might need a parent_doc_id -> child doc ids[] function. In turn, due to deep nesting, this may require introducing empty parent document.