searcher/indexer join/leave the cluster frequently

mingshun commented 11 months ago

Describe the bug I deploy quickwit on EKS. HTTP probes of indexers and searchers are failed with statuscode 503. Both indexers and searches output lots of the following logs repeatly:

WARN quickwit_serve: Metastore service is unavailable. metastore_uri=grpc://metastore.service.cluster error=The metastore service is unavailable.

metastore output lots of the following logs:

2023-12-21T11:59:06.635Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T11:59:07.636Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T11:59:13.635Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T11:59:14.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T11:59:15.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T11:59:26.649Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T11:59:55.637Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T11:59:57.637Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:00:46.634Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:00:49.635Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T12:00:50.638Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T12:01:06.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-0` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-0
2023-12-21T12:01:06.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:01:26.637Z  INFO quickwit_cluster::change: Node `quickwit-searcher-0` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-0
2023-12-21T12:01:27.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:01:28.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:01:31.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:01:34.635Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:02:33.633Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:02:34.633Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:03:15.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:03:18.634Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:03:26.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:03:27.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1

It seems searches and indexers are joining/leaving the cluster frequently. Quickwit search rest API is not working. I don't if ingest API is working because I can not get data from search API.

Steps to reproduce (if applicable) I have no idea about how to reproduce.

Expected behavior How to fix it?

Configuration: Please provide:

Output of quickwit --version

Quickwit v0.6.5 (5cf786d 2023-12-11T13:37:05Z)

The index_config.yaml
```
version: 0.6
```

index_id: my-index

doc_mapping: mode: strict field_mappings:

name: trace_id type: text fast: true
name: trace_state type: text indexed: false
name: service_name type: text tokenizer: raw
name: resource_attributes type: json tokenizer: raw
name: resource_dropped_attributes_count type: u64 indexed: false
name: scope_name type: text indexed: true
name: scope_version type: text indexed: false
name: scope_attributes type: json indexed: false
name: scope_dropped_attributes_count type: u64 indexed: false
name: span_id type: text tokenizer: raw
name: parent_span_id type: text fast: true tokenizer: raw
name: span_kind type: u64
name: span_name type: text tokenizer: raw
name: span_start_timestamp_nanos type: datetime input_formats: [unix_timestamp] output_format: unix_timestamp_nanos indexed: false fast: true precision: milliseconds
name: span_end_timestamp_nanos type: datetime input_formats: [unix_timestamp] output_format: unix_timestamp_nanos indexed: false fast: false
name: span_duration_millis type: u64 indexed: false fast: true stored: false
name: span_attributes type: json tokenizer: en_stem record: position
name: span_dropped_attributes_count type: u64 indexed: false
name: span_dropped_events_count type: u64 indexed: false
name: span_dropped_links_count type: u64 indexed: false
name: span_status type: json indexed: true
name: events type: array tokenizer: raw
name: event_names type: array tokenizer: default record: position stored: false
name: links type: array tokenizer: raw

timestamp_field: span_start_timestamp_nanos

partition_key: hash_mod(service_name, 100)

tag_fields: [service_name]

indexing_settings: commit_timeout_secs: 5

search_settings: default_search_fields: []

retention: period: 5 days schedule: daily

guilload commented 11 months ago

The logs show that the pods running indexers and searchers cannot connect to your metastore. As a result, they never reach the ready state, and eventually, Kube restarts them. This cycle is repeating indefinitely. This is a symptom. If we fix the inability of the indexers and searchers to connect to the metastore, it will disappear.

Did you use the Quickwit Helm chart?

fmassot commented 11 months ago

@mingshun can you share the logs of the metastore pod?

mingshun commented 11 months ago

@fmassot metastore logs: quickwit.log

fmassot commented 11 months ago

Thanks, the metastore seems to be fine. Can you share the output of kubectl get pods? Did you try to restart the metastore or delete/create the cluster?

mingshun commented 11 months ago

@fmassot

NAME                                                 READY   STATUS      RESTARTS   AGE
quickwit-control-plane-8d58cfb78-bj958               1/1     Running     0          27h
quickwit-indexer-0                                   0/1     Running     0          26h
quickwit-indexer-1                                   0/1     Running     0          26h
quickwit-indexer-2                                   0/1     Running     0          26h
quickwit-janitor-6cf74f8dbf-r4jv8                    1/1     Running     0          27h
quickwit-metastore-75fbd6775-pltvv                   1/1     Running     0          27h
quickwit-searcher-0                                  0/1     Running     0          27h
quickwit-searcher-1                                  1/1     Running     0          27h
quickwit-searcher-2                                  1/1     Running     0          27h

searcher-2 is ready now. But indexers and searcher-0 are still unavailable.

I tried to restart all components. At the beginning, they joined the cluster. After several minutes, some of them left.

I have checked the network connectivity and no problem found.

fmassot commented 11 months ago

Ok. Can you share the state of the cluster of quickwit-indexer-0, quickwit-searcher-0, quickwit-searcher-1 and quickwit-metastore-75fbd6775-pltvv? The state is given by the endpoint http://quickwit-indexer-0:7280/api/v1/cluster

Additional question: are you using a file backed metastore or a PostgreSQL metastore?

mingshun commented 11 months ago

I have tried to restart searchers and indexers yesterday. And the current pod status is:

NAME                                                 READY   STATUS      RESTARTS      AGE
quickwit-control-plane-8d58cfb78-bj958               1/1     Running     0             43h
quickwit-indexer-0                                   0/1     Running     0             15h
quickwit-indexer-1                                   1/1     Running     1 (15h ago)   15h
quickwit-indexer-2                                   0/1     Running     0             15h
quickwit-janitor-6cf74f8dbf-r4jv8                    1/1     Running     0             43h
quickwit-metastore-75fbd6775-pltvv                   1/1     Running     0             43h
quickwit-searcher-0                                  1/1     Running     0             15h
quickwit-searcher-1                                  1/1     Running     0             15h
quickwit-searcher-2                                  1/1     Running     0             15h

indexer-0.txt indexer-1.txt metastore.txt searcher-0.txt searcher-1.txt

Using PostgreSQL metastore.

fmassot commented 11 months ago

Thanks for sharing this. I will have a look closely next week. If you need to fix that in the meantime, I would delete the cluster, wait for all nodes to shutdown and recreate the cluster. This will wipe out the chitchat state.

mingshun commented 10 months ago

I deleted the cluster and then recreated a new one 4 days ago. The cluster works fine so far.

fmassot commented 10 months ago

Ok, thanks for reporting that. One possible issue could come from chitchat, the gossiping library which Quickwit uses to form a cluster. We recently fixed a few issues on it and 0.6.5 does not include them. I will try to reproduce the problem with a lot of disconnection between nodes.

quickwit-oss / quickwit

searcher/indexer join/leave the cluster frequently #4313

partition_key: hash_mod(service_name, 100)

tag_fields: [service_name]