Closed fulmicoton closed 2 weeks ago
The flamegraph is the first peak when dispatching the requests, but before the collection. It's the second request, so caches should be filled. The request spans 1000 indices each with 100 splits (it's a *
search, not sure if we search all 100 splits), the splits have only some kb of data. It's on a single quickwit node.
120Mb jobs_to_leaf_requests
. Probably mainly caused by the serialized doc_mapper
in LeafSearchRequest
.
525Mb creating futures in the try_join_all
661Mb for push
, which is probably pushing the future against leaf_request_tasks
The Futures seem to cost an unreasonable amount of memory and is probably related to how the data is captured in async fn leaf_search
.
async fn leaf_search
is indeed really big with 4kb per instance. This cost is exacerbated by the main problem: we currently create num index * num splits
leaf search requests instead of num index
leaf requests. https://github.com/quickwit-oss/quickwit/pull/4880 fixes that.
https://github.com/quickwit-oss/quickwit/pull/4880 and https://github.com/quickwit-oss/quickwit/pull/4922 fixes issues which caused to emit a leaf request for each split instead for each index on multi index queries. (e.g. 1000 indices * 100 splits = 100_000 requests).
For scenarios with a high amount of indices, it would be beneficial to have leaf requests that span multiple indices (number 3 in first post).
Do leaf request actually care about the index? If I recall correctly we ship the docmapper's json. Could we send one request per docmapper?
Yes the leaf requests are index specific currently, they use e.g. storage and split_offsets.
On the request there are currently index specific fields on the root level of the request:
split_offsets
, doc_mapper
and index_uri
.
pub struct LeafSearchRequest {
/// Search request. This is a perfect copy of the original search request,
/// that was sent to root apart from the start_offset & max_hits params.
#[prost(message, optional, tag = "1")]
pub search_request: ::core::option::Option<SearchRequest>,
/// Index split ids to apply the query on.
/// This ids are resolved from the index_uri defined in the search_request.
#[prost(message, repeated, tag = "4")]
pub split_offsets: ::prost::alloc::vec::Vec<SplitIdAndFooterOffsets>,
/// `DocMapper` as json serialized trait.
#[prost(string, tag = "5")]
pub doc_mapper: ::prost::alloc::string::String,
/// Index URI. The index URI defines the location of the storage that contains the
/// split files.
#[prost(string, tag = "6")]
pub index_uri: ::prost::alloc::string::String,
}
MultiLeafRequest could include some deduplication like this:
MultiLeafRequest{
/// May be necessary to also have an Vec<SearchRequest>, if the request becomes index specific.
search_request: Option<SearchRequest>,
// The leaf requests
leaf_requests: Vec<LeafRequestRef>,
// All unique doc_mappers
doc_mappers: Vec<String>,
index_uris: Vec<String>,
}
LeafRequestRef{
/// ordinal in MultiLeafRequest.doc_mappers
doc_mapper_ord: usize,
split_offsets: Vec<SplitIdAndFooterOffsets>,
index_uri_ord: String,
}
One request per node would also allow us to respect the aggregation limits or other limits we may add. It will also allow us more control, e.g. early exiting.
I don't think split_offsets is really index specific. split_offsets
and search_request
contain the index_id but I assume it is never used by leaves, is it?
doc_mapper
is the thing I was suggesting to group by.
index_uri
is the painful one.
Something like the following could work maybe?
pub struct LeafSearchRequest {
/// Search request. This is a perfect copy of the original search request,
/// that was sent to root apart from the start_offset & max_hits params.
#[prost(message, optional, tag = "1")]
pub search_request: ::core::option::Option<SearchRequest>,
/// Index split ids to apply the query on.
/// This ids are resolved from the index_uri defined in the search_request.
#[prost(message, repeated, tag = "4")]
pub split_offsets: ::prost::alloc::vec::Vec<SplitUriAndFooterOffsets>,
/// `DocMapper` as json serialized trait.
#[prost(string, tag = "5")]
pub doc_mapper: ::prost::alloc::string::String,
}
I don't think split_offsets is really index specific.
split_offsets
andsearch_request
contain the index_id but I assume it is never used by leaves, is it?
split_offsets
are only valid within their index_uri
, apart from that there's no dependency to index.
search_request
only contains the index_id_patterns
from the user request and is not used in the leaves, but
index_id
from split_offsets
is used by the leaves for top k optimization.
doc_mapper
is the thing I was suggesting to group by.index_uri
is the painful one.Something like the following could work maybe?
pub struct LeafSearchRequest { /// Search request. This is a perfect copy of the original search request, /// that was sent to root apart from the start_offset & max_hits params. #[prost(message, optional, tag = "1")] pub search_request: ::core::option::Option<SearchRequest>, /// Index split ids to apply the query on. /// This ids are resolved from the index_uri defined in the search_request. #[prost(message, repeated, tag = "4")] pub split_offsets: ::prost::alloc::vec::Vec<SplitUriAndFooterOffsets>, /// `DocMapper` as json serialized trait. #[prost(string, tag = "5")] pub doc_mapper: ::prost::alloc::string::String, }
Yes that would work, but SplitUriAndFooterOffsets
would need to include both split_id
and split_uri
.
The request looks a little bit nicer, but I think the behavior of the system is easier to predict with one request per node.
I think the behavior of the system is easier to predict with one request per node
Ok let's go with one request per node!
multi index leaf requests https://github.com/quickwit-oss/quickwit/pull/4962
Right now, when we query several indexes at the same time, root search looks like this:
(leaf node, index)
Cicada has 1000 indexes and can experience OOM. The current understanding is that the reception of the
1000 * number of searcher
leaf search response and their merging is what causes the OOM.We need to find a way to either/or: 1) make sure we don't OOM and instead return an explicit error 2) curb the memory usage, by changing (A). It the presence of multi index, we should prefer assign the splits targetting the same index on the same searcher. In the case of cicada however, there are only 3 searchers so this will only reduce the burden by at most 3. 3) curb the memory usage, by sending one leaf request containing potentially several idnexes to the leaf nodes, and have them take care of the burden of merging. 4) curb the memory usage, by improving the way we do merges in the root. In other words, stream them instead of materializing the response in memory.
@fmassot can give more informations on how to reproduce.