OOM when search many indexes

fulmicoton commented 6 months ago

Right now, when we query several indexes at the same time, root search looks like this:

fetch all splits metadata
(A) assign them to the different leaf using the same logic as before
(B) run one leaf request per (leaf node, index)
merge the results

Cicada has 1000 indexes and can experience OOM. The current understanding is that the reception of the 1000 * number of searcher leaf search response and their merging is what causes the OOM.

We need to find a way to either/or: 1) make sure we don't OOM and instead return an explicit error 2) curb the memory usage, by changing (A). It the presence of multi index, we should prefer assign the splits targetting the same index on the same searcher. In the case of cicada however, there are only 3 searchers so this will only reduce the burden by at most 3. 3) curb the memory usage, by sending one leaf request containing potentially several idnexes to the leaf nodes, and have them take care of the burden of merging. 4) curb the memory usage, by improving the way we do merges in the root. In other words, stream them instead of materializing the response in memory.

@fmassot can give more informations on how to reproduce.

PSeitz commented 6 months ago

Analysis - Request dispatch

The flamegraph is the first peak when dispatching the requests, but before the collection. It's the second request, so caches should be filled. The request spans 1000 indices each with 100 splits (it's a * search, not sure if we search all 100 splits), the splits have only some kb of data. It's on a single quickwit node.

flame_qw_1000idx

120Mb jobs_to_leaf_requests. Probably mainly caused by the serialized doc_mapper in LeafSearchRequest. 525Mb creating futures in the try_join_all 661Mb for push, which is probably pushing the future against leaf_request_tasks

The Futures seem to cost an unreasonable amount of memory and is probably related to how the data is captured in async fn leaf_search.

PSeitz commented 6 months ago

async fn leaf_search is indeed really big with 4kb per instance. This cost is exacerbated by the main problem: we currently create num index * num splits leaf search requests instead of num index leaf requests. https://github.com/quickwit-oss/quickwit/pull/4880 fixes that.

PSeitz commented 6 months ago

https://github.com/quickwit-oss/quickwit/pull/4880 and https://github.com/quickwit-oss/quickwit/pull/4922 fixes issues which caused to emit a leaf request for each split instead for each index on multi index queries. (e.g. 1000 indices * 100 splits = 100_000 requests).

For scenarios with a high amount of indices, it would be beneficial to have leaf requests that span multiple indices (number 3 in first post).

fulmicoton commented 6 months ago

Do leaf request actually care about the index? If I recall correctly we ship the docmapper's json. Could we send one request per docmapper?

PSeitz commented 6 months ago

Yes the leaf requests are index specific currently, they use e.g. storage and split_offsets.

On the request there are currently index specific fields on the root level of the request: split_offsets, doc_mapper and index_uri.

pub struct LeafSearchRequest {
    /// Search request. This is a perfect copy of the original search request,
    /// that was sent to root apart from the start_offset & max_hits params.
    #[prost(message, optional, tag = "1")]
    pub search_request: ::core::option::Option<SearchRequest>,
    /// Index split ids to apply the query on.
    /// This ids are resolved from the index_uri defined in the search_request.
    #[prost(message, repeated, tag = "4")]
    pub split_offsets: ::prost::alloc::vec::Vec<SplitIdAndFooterOffsets>,
    /// `DocMapper` as json serialized trait.
    #[prost(string, tag = "5")]
    pub doc_mapper: ::prost::alloc::string::String,
    /// Index URI. The index URI defines the location of the storage that contains the
    /// split files.
    #[prost(string, tag = "6")]
    pub index_uri: ::prost::alloc::string::String,
}

MultiLeafRequest could include some deduplication like this:


MultiLeafRequest{
    /// May be necessary to also have an Vec<SearchRequest>, if the request becomes index specific.
    search_request: Option<SearchRequest>,
     // The leaf requests
    leaf_requests: Vec<LeafRequestRef>,
    // All unique doc_mappers
    doc_mappers: Vec<String>,
    index_uris: Vec<String>,
}

LeafRequestRef{
    /// ordinal in MultiLeafRequest.doc_mappers
    doc_mapper_ord: usize,
    split_offsets: Vec<SplitIdAndFooterOffsets>,
    index_uri_ord: String,
}

One request per node would also allow us to respect the aggregation limits or other limits we may add. It will also allow us more control, e.g. early exiting.

fulmicoton commented 6 months ago

I don't think split_offsets is really index specific. split_offsets and search_request contain the index_id but I assume it is never used by leaves, is it?

doc_mapper is the thing I was suggesting to group by. index_uri is the painful one.

Something like the following could work maybe?

pub struct LeafSearchRequest {
    /// Search request. This is a perfect copy of the original search request,
    /// that was sent to root apart from the start_offset & max_hits params.
    #[prost(message, optional, tag = "1")]
    pub search_request: ::core::option::Option<SearchRequest>,
    /// Index split ids to apply the query on.
    /// This ids are resolved from the index_uri defined in the search_request.
    #[prost(message, repeated, tag = "4")]
    pub split_offsets: ::prost::alloc::vec::Vec<SplitUriAndFooterOffsets>,
    /// `DocMapper` as json serialized trait.
    #[prost(string, tag = "5")]
    pub doc_mapper: ::prost::alloc::string::String,
}

PSeitz commented 6 months ago

I don't think split_offsets is really index specific. split_offsets and search_request contain the index_id but I assume it is never used by leaves, is it?

split_offsets are only valid within their index_uri, apart from that there's no dependency to index.

search_request only contains the index_id_patterns from the user request and is not used in the leaves, but index_id from split_offsets is used by the leaves for top k optimization.

doc_mapper is the thing I was suggesting to group by. index_uri is the painful one.

Something like the following could work maybe?

pub struct LeafSearchRequest {
    /// Search request. This is a perfect copy of the original search request,
    /// that was sent to root apart from the start_offset & max_hits params.
    #[prost(message, optional, tag = "1")]
    pub search_request: ::core::option::Option<SearchRequest>,
    /// Index split ids to apply the query on.
    /// This ids are resolved from the index_uri defined in the search_request.
    #[prost(message, repeated, tag = "4")]
    pub split_offsets: ::prost::alloc::vec::Vec<SplitUriAndFooterOffsets>,
    /// `DocMapper` as json serialized trait.
    #[prost(string, tag = "5")]
    pub doc_mapper: ::prost::alloc::string::String,
}

Yes that would work, but SplitUriAndFooterOffsets would need to include both split_id and split_uri. The request looks a little bit nicer, but I think the behavior of the system is easier to predict with one request per node.

fulmicoton commented 6 months ago

I think the behavior of the system is easier to predict with one request per node

Ok let's go with one request per node!

PSeitz commented 2 weeks ago

multi index leaf requests https://github.com/quickwit-oss/quickwit/pull/4962

quickwit-oss / quickwit

OOM when search many indexes #4855

Analysis - Request dispatch