quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
6.99k stars 291 forks source link

Merge timeout errors #5149

Closed fredsig closed 1 week ago

fredsig commented 2 weeks ago

Hi there!

Started to look into testing Quickwit, running latest helm chart in EKS with v0.8.2, 1 indexer pod with local attached EBS volume for /qwdata (gp3 120G), doing 3.20 MB/s, 3k docs/s indexing on a test index (config copied from default otel logs index with added retention settings). Storage backend in S3, metastore in PostgreSQL. No problems with searching/indexing (from what I can see), but observing the following errors (every 10m). It seems to have started after the index reached a certain size:

2024-06-22 14:50:42.396 2024-06-22T13:50:42.396Z  WARN quickwit_indexing::actors::merge_planner: Rebuilding the known split ids set ended up not halving its size. Please report. This is likely a bug, please report. known_split_ids_len_after=11341 known_split_ids_len_before=11341
2024-06-22 14:50:58.105 2024-06-22T13:50:58.105Z ERROR quickwit_actors::actor_handle: actor-timeout actor="MergePackager-summer-VvmM"
2024-06-22 14:50:58.105 2024-06-22T13:50:58.105Z ERROR quickwit_indexing::actors::merge_pipeline: Merge pipeline failure. pipeline_id=IndexingPipelineId { node_id: "quickwit-indexer-0", index_uid: IndexUid { index_id: "test-otel_logs", incarnation_id: Ulid(2078132869030325623795064384611517789) }, source_id: "_ingest-api-source", pipeline_uid: Pipeline(01J0ZVPH67RSNGAPH5J85JWKJC) } generation=15 healthy_actors=["MergePlanner-spring-KchS", "MergeSplitDownloader-red-o2dM", "MergeExecutor-restless-56TZ", "MergeUploader-still-IqII", "MergePublisher-wild-3BWt"] failed_or_unhealthy_actors=["MergePackager-summer-VvmM"] success_actors=[]
2024-06-22 14:51:21.069 2024-06-22T13:51:21.068Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=MergePackager-summer-VvmM exit_status=DownstreamClosed
2024-06-22 14:59:27.607 2024-06-22T13:59:27.607Z  WARN quickwit_indexing::actors::merge_planner: Rebuilding the known split ids set ended up not halving its size. Please report. This is likely a bug, please report. known_split_ids_len_after=11443 known_split_ids_len_before=11443test
2024-06-22 14:59:51.873 2024-06-22T13:59:51.873Z ERROR quickwit_actors::actor_handle: actor-timeout actor="MergePackager-blue-3z6p"
2024-06-22 14:59:51.873 2024-06-22T13:59:51.873Z ERROR quickwit_indexing::actors::merge_pipeline: Merge pipeline failure. pipeline_id=IndexingPipelineId { node_id: "quickwit-indexer-0", index_uid: IndexUid { index_id: "test-otel_logs", incarnation_id: Ulid(2078132869030325623795064384611517789) }, source_id: "_ingest-api-source", pipeline_uid: Pipeline(01J0ZVPH67RSNGAPH5J85JWKJC) } generation=16 healthy_actors=["MergePlanner-spring-KchS", "MergeSplitDownloader-hidden-4YI6", "MergeExecutor-polished-ye68", "MergeUploader-late-jmqJ", "MergePublisher-holy-bhvO"] failed_or_unhealthy_actors=["MergePackager-blue-3z6p"] success_actors=[]
2024-06-22 15:00:05.670 2024-06-22T14:00:05.670Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=MergePackager-blue-3z6p exit_status=DownstreamClosed

Is this expected (should be a WARNING) or is there something I've missed on the configuration.

This was gathered by != "INFO" in my query so here's the complete sequence of logs:

qw-logs.txt

Here's some metrics (from a 1 hour view)

341977468-aad38928-3903-49e1-b268-3a93b6607753 Screenshot 2024-06-22 at 16 00 07

Metastore p95 latency looks very fast, I see no noticeable spikes.

Some k8s metrics for the quickwit namespace (indexer pod has been configured with 4G RAM, no OOM issues):

Screenshot 2024-06-22 at 15 35 07

EBS volume stats (indexer /qwdata):

Screenshot 2024-06-22 at 15 37 07

Index config: test-otel_logs.json

My helm chart config section:

# Quickwit configuration
config:
  default_index_root_uri: s3://prod-xxx-quickwit/indexes
  postgres:
    host: xxx.com
    port: 5432
    database: metastore
    user: user
    password: xxx
  metastore:
    postgres:
      min_connections: 10
      max_connections: 50
      acquire_connection_timeout: 30s
      idle_connection_timeout: 1h
      max_connection_lifetime: 1d
  storage:
    s3:
      region: us-east-1
  # Indexer settings
  indexer:
    enable_otlp_endpoint: true
fredsig commented 1 week ago

Just realised that I was missing running the janitor component (so, no splits were being deleted). Once I've enabled it, janitor now is catching up with the high number of splits to be deleted, so far errors are gone.

fulmicoton commented 1 week ago

Thanks for letting us know. Let's close this then. Looking rapidly at your chart, I am a tiny bit scared by the pending merge operation chart. It seems to be ok, but you do not have much margin. If you see the pyramid pattern not being able to reach 0 periodically, you might want to tweak the merge_concurrency parameter.