High Performance Indexing

PSeitz commented 1 year ago

This doc outlines the steps of how an optimal pipeline on parsing JSON and passing it to tantivy could look like. Currently the JSON handling costs around 20% - 30% of total CPU time when ingesting data. Additional hidden costs for nested JSON and cache locality are hard to gauge.

Note: We could easily parallelize JSON parsing to increase throughput. But that CPU time could be saved, or be used for higher compression instead.

Steps

Retrieve unparsed JSON as String
Validate UTF-8 once on the unparsed JSON. (Currently this is done for every String)
Find positions of (potentially escaped) strings (like in simd_json)
Unescape the strings inplace and update the positions. It can be done on the original String, since escaped JSON strings are always longer. E.g. escape inplace and write zeros, to keep the string length the same. key:"\"My Quote\"" => key:""My Quote""00
Now we can parse the json and reference everything with &str (similar to serde_json_borrow). As a added bonus this will increase cache locality. 5.1. Consider parsing into decimal for floats
Nested data in JSON is usually allocated in a BTreeMap, but this is unnecessary work since we want it flattened anyway. Therefore we can preflatten the it into a Vec<&str, tantivy::Value> (Related to https://github.com/quickwit-oss/tantivy/issues/2015). We'll need two groups
- Root Level Attributes: &str => Value
- Nested Attributes: JSON path (e.g. "json.vals.blub", but use tantivy format with \0 as a separator) String => Value. Maybe Arc<String> => Value from a JSON path Hashmap<&[&[u8]], Arc<String>

We would have something that looks similar to:

struct QuickwitDoc{
unparsed: String,
root: Vec<(&'static str, Value<'static'>)>, // references slices from unparsed, cast to static lifetime
nested : Vec<(Arc<String>, Value<'static'>)>,// references slices from unparsed, cast to static lifetime
}

Implement the document trait on QuickwitDoc, to avoid conversion into Document https://github.com/quickwit-oss/tantivy/issues/1352

Notes

I'm not sure how array<> is handled in quickwit currently. QuickwitDoc should probably contain a UnorderedId from https://github.com/quickwit-oss/tantivy/issues/2015 Side Note: Indexing throughput can already be considered high performance as of now with ~30Mb/s

guilload commented 1 year ago

Can we work off Bytes objects rather than String? Upstream, we get batches of documents separated by a new line as Bytes from sources. Internally, we "transport" batches of concatenated documents, which are easy to split as individual documents, also represented as Bytes.

struct DocBatch {
  doc_buffer: Bytes,
  doc_lengths: Vec<u32>,
}

Working directly with Bytes will avoid allocations to create new String objects.

guilload commented 1 year ago

Hum, this will ruin the in-place updates that you wanted to perform.

fulmicoton commented 1 year ago

fulmicoton commented 11 months ago

A possible simple path to get this work done:

add a json path dictionary on the SegmentWriter. That index will allocate an unordered id u32 to each json path we encounter.
we stop storing the json path in the indexer term hashmap. Instead we store this path unordered id.
on serialization, we compute the unordered -> ordered mapping, and first replace all unordered id by the ordered id. We then sort terms like we always do. Then upon serialization of the term dictionary we do the work of computing the actual Term by replacing the path ordered id by the path.

At this point all of this work can be implemented in tantivy entirely. It would reduce the memory footprint of the indexer, at the cost of possibly making it a bit slower.

Then we could then make it possible for users to:

pass their own json path dictionary to the segment writer at serialize
add a new undocumented document value type that is UnorderedJsonLiteral { unordered_path_id: u32, val: JsonLiteral }

On quickwit side instead of building a json object and passing it to tantivy, we could then just build our own json path unordered id dicitonary and append as many field to the Vec of FieldValue, as we have leaf in our json.

Alternative: Just have the two world coexist on tantivy. Some terms have their json path encoded as is in the termhashmap, some, fed by quickwit, are not.

PSeitz commented 11 months ago

I had similar thoughts, but with some differences. (Also in relation to https://github.com/quickwit-oss/quickwit/issues/3896)

Currently there are two limitations in tantivy:

Root fields can't be dynamic (aka JSON/Object fields)
Nested paths can't be configured, only top level paths

Dynamic Schema

Change the API for Document from (Field, Value) to (&str, Value).

In dynamic mode there's a root config, additionally any JSON path can be configured. A configured path has a flag to include sub paths.

On encountering a path, the SegmentWriter will do a lookup.

Exists -> [Unordered Id, Configuration]
Does not Exist
- Dynamic: [Create Unordered id, Configuration]
- Lenient: Ignore
- Fixed: Error

The rest is similar, e.g. store the unordered id in the indexer term hashmap.

On serialization create a global dictionary that contains all dynamic path + value. Non dynamic paths can still be Field.

It should be possible to provide flattened paths, to reduce the number of lookups for nested paths.

I don't think users can pass their own json path dictionary to the segment writer at serialize, since commits may be triggered when reaching the memory budget, and the actual paths encountered is only known to the SegmentWriter (in dynamic mode).

quickwit-oss / quickwit