quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8k stars 327 forks source link

Indexing performance drop #5028

Open fulmicoton opened 4 months ago

fulmicoton commented 4 months ago

We apparently have a performance regression between

as spotted by @fmassot

fulmicoton commented 4 months ago

Tantivy went from -tantivy = { git = "https://github.com/quickwit-oss/tantivy/", rev = "6181c1e", default-features = false, features = [ +tantivy = { git = "https://github.com/quickwit-oss/tantivy/", rev = "92b5526", default-features = false, features = [

RaphaelMarinier commented 4 months ago

Note that the CPU time did not change: Screenshot from 2024-05-24 10-58-05 So the regression is probably due to some reduction in computation parallelism, and not the computations themselves taking more CPU time.

fulmicoton commented 4 months ago

Not necessarily tantivy then

PSeitz commented 4 months ago

The drop is caused by this PR: Fix the ingest rate displayed in the CLI https://github.com/quickwit-oss/quickwit/pull/4682

https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1574&search_metric=engine_duration (the name for 894188f19 should be before_fix_ingest_rate)

fulmicoton commented 4 months ago

:) I didn't see that coming :D

fulmicoton commented 4 months ago

@PSeitz This PR seems very safe BUT... it actually DOES something on the server side of ingest v1.

It introduced a change (hopefully a bugfix) in the code of our rate estimator. The rate estimator itself is used in a strange RateModulator.

The ideas is this: Quickwit need one way or another to have some backpressure mechanism. It was judged at the time that sending back 419 status code could be a problem from clients.

To avoid it (it does not really avoid it but well), when quickwit sees the queue memory getting close to its limit, it will smoothly time::sleep on the server side before returning a 200. If the memory limit is reached we do return 419 however.

We can try and remove this logic, and see if it works. (the quickwit client, retries after 500ms upon receiving a 419 anyway). If this fixes the bug, then we can dig deeper and see if:

Can you rerun the bench without the rate modulation layer? You will find it in the start_ingest_client_if_needed function.

PSeitz commented 4 months ago

Disabling the rate modulator fixes the performance issue: https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1581&search_metric=engine_duration

The first call to get the memory use in the rate modulator is returning 5 (5x the max capacity), which then causes the ingestion to be slower. I tested to set memory_usage_ratio to a fixed 0.1, but it's still slightly slower than before. https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1581,1584&search_metric=engine_duration

[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 5.448237061500549e-8
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.004656488075852394