Open PSeitz opened 1 year ago
Can we work off Bytes
objects rather than String
? Upstream, we get batches of documents separated by a new line as Bytes
from sources. Internally, we "transport" batches of concatenated documents, which are easy to split as individual documents, also represented as Bytes
.
struct DocBatch {
doc_buffer: Bytes,
doc_lengths: Vec<u32>,
}
Working directly with Bytes
will avoid allocations to create new String
objects.
Hum, this will ruin the in-place updates that you wanted to perform.
A possible simple path to get this work done:
SegmentWriter
.
That index will allocate an unordered id u32
to each json path we encounter.Term
by replacing the path ordered id by the path.At this point all of this work can be implemented in tantivy entirely. It would reduce the memory footprint of the indexer, at the cost of possibly making it a bit slower.
Then we could then make it possible for users to:
UnorderedJsonLiteral { unordered_path_id: u32, val: JsonLiteral }
On quickwit side instead of building a json object and passing it to tantivy, we could then just build our own json path unordered id dicitonary and append as many field to the Vec of FieldValue, as we have leaf in our json.
Alternative: Just have the two world coexist on tantivy. Some terms have their json path encoded as is in the termhashmap, some, fed by quickwit, are not.
I had similar thoughts, but with some differences. (Also in relation to https://github.com/quickwit-oss/quickwit/issues/3896)
Currently there are two limitations in tantivy:
Change the API for Document
from (Field, Value)
to (&str, Value)
.
In dynamic mode there's a root config, additionally any JSON path can be configured. A configured path has a flag to include sub paths.
On encountering a path, the SegmentWriter
will do a lookup.
The rest is similar, e.g. store the unordered id in the indexer term hashmap.
On serialization create a global dictionary that contains all dynamic path + value.
Non dynamic paths can still be Field
.
It should be possible to provide flattened paths, to reduce the number of lookups for nested paths.
I don't think users can pass their own json path dictionary to the segment writer at serialize, since commits may be triggered when reaching the memory budget, and the actual paths encountered is only known to the SegmentWriter (in dynamic mode).
This doc outlines the steps of how an optimal pipeline on parsing JSON and passing it to tantivy could look like. Currently the JSON handling costs around 20% - 30% of total CPU time when ingesting data. Additional hidden costs for nested JSON and cache locality are hard to gauge.
Note: We could easily parallelize JSON parsing to increase throughput. But that CPU time could be saved, or be used for higher compression instead.
Steps
String
key:"\"My Quote\""
=>key:""My Quote""00
&str
(similar to serde_json_borrow). As a added bonus this will increase cache locality. 5.1. Consider parsing into decimal for floatsBTreeMap
, but this is unnecessary work since we want it flattened anyway. Therefore we can preflatten the it into aVec<&str, tantivy::Value>
(Related to https://github.com/quickwit-oss/tantivy/issues/2015). We'll need two groups&str => Value
"json.vals.blub"
, but use tantivy format with \0 as a separator)String => Value
. MaybeArc<String> => Value
from a JSON pathHashmap<&[&[u8]], Arc<String>
Document
https://github.com/quickwit-oss/tantivy/issues/1352Notes
I'm not sure how
array<>
is handled in quickwit currently.QuickwitDoc
should probably contain aUnorderedId
from https://github.com/quickwit-oss/tantivy/issues/2015 Side Note: Indexing throughput can already be considered high performance as of now with ~30Mb/s