quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.29k stars 339 forks source link

searcher/indexer join/leave the cluster frequently #4313

Open mingshun opened 11 months ago

mingshun commented 11 months ago

Describe the bug I deploy quickwit on EKS. HTTP probes of indexers and searchers are failed with statuscode 503. Both indexers and searches output lots of the following logs repeatly:

WARN quickwit_serve: Metastore service is unavailable. metastore_uri=grpc://metastore.service.cluster error=The metastore service is unavailable.

metastore output lots of the following logs:

2023-12-21T11:59:06.635Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T11:59:07.636Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T11:59:13.635Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T11:59:14.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T11:59:15.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T11:59:26.649Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T11:59:55.637Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T11:59:57.637Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:00:46.634Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:00:49.635Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T12:00:50.638Z  INFO quickwit_cluster::change: Node `quickwit-indexer-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-2
2023-12-21T12:01:06.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-0` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-0
2023-12-21T12:01:06.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:01:26.637Z  INFO quickwit_cluster::change: Node `quickwit-searcher-0` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-0
2023-12-21T12:01:27.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:01:28.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:01:31.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:01:34.635Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:02:33.633Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:02:34.633Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:03:15.636Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has left the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:03:18.634Z  INFO quickwit_cluster::change: Node `quickwit-searcher-2` has joined the cluster. cluster_id=quickwit node_id=quickwit-searcher-2
2023-12-21T12:03:26.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has left the cluster. cluster_id=quickwit node_id=quickwit-indexer-1
2023-12-21T12:03:27.634Z  INFO quickwit_cluster::change: Node `quickwit-indexer-1` has joined the cluster. cluster_id=quickwit node_id=quickwit-indexer-1

It seems searches and indexers are joining/leaving the cluster frequently. Quickwit search rest API is not working. I don't if ingest API is working because I can not get data from search API.

Steps to reproduce (if applicable) I have no idea about how to reproduce.

Expected behavior How to fix it?

Configuration: Please provide:

  1. Output of quickwit --version
    Quickwit v0.6.5 (5cf786d 2023-12-11T13:37:05Z)
  2. The index_config.yaml
    
    version: 0.6

index_id: my-index

doc_mapping: mode: strict field_mappings:

indexing_settings: commit_timeout_secs: 5

search_settings: default_search_fields: []

retention: period: 5 days schedule: daily

guilload commented 11 months ago

The logs show that the pods running indexers and searchers cannot connect to your metastore. As a result, they never reach the ready state, and eventually, Kube restarts them. This cycle is repeating indefinitely. This is a symptom. If we fix the inability of the indexers and searchers to connect to the metastore, it will disappear.

Did you use the Quickwit Helm chart?

fmassot commented 11 months ago

@mingshun can you share the logs of the metastore pod?

mingshun commented 11 months ago

@fmassot metastore logs: quickwit.log

fmassot commented 11 months ago

Thanks, the metastore seems to be fine. Can you share the output of kubectl get pods? Did you try to restart the metastore or delete/create the cluster?

mingshun commented 11 months ago

@fmassot

NAME                                                 READY   STATUS      RESTARTS   AGE
quickwit-control-plane-8d58cfb78-bj958               1/1     Running     0          27h
quickwit-indexer-0                                   0/1     Running     0          26h
quickwit-indexer-1                                   0/1     Running     0          26h
quickwit-indexer-2                                   0/1     Running     0          26h
quickwit-janitor-6cf74f8dbf-r4jv8                    1/1     Running     0          27h
quickwit-metastore-75fbd6775-pltvv                   1/1     Running     0          27h
quickwit-searcher-0                                  0/1     Running     0          27h
quickwit-searcher-1                                  1/1     Running     0          27h
quickwit-searcher-2                                  1/1     Running     0          27h

searcher-2 is ready now. But indexers and searcher-0 are still unavailable.

I tried to restart all components. At the beginning, they joined the cluster. After several minutes, some of them left.

I have checked the network connectivity and no problem found.

fmassot commented 11 months ago

Ok. Can you share the state of the cluster of quickwit-indexer-0, quickwit-searcher-0, quickwit-searcher-1 and quickwit-metastore-75fbd6775-pltvv? The state is given by the endpoint http://quickwit-indexer-0:7280/api/v1/cluster

Additional question: are you using a file backed metastore or a PostgreSQL metastore?

mingshun commented 11 months ago

I have tried to restart searchers and indexers yesterday. And the current pod status is:

NAME                                                 READY   STATUS      RESTARTS      AGE
quickwit-control-plane-8d58cfb78-bj958               1/1     Running     0             43h
quickwit-indexer-0                                   0/1     Running     0             15h
quickwit-indexer-1                                   1/1     Running     1 (15h ago)   15h
quickwit-indexer-2                                   0/1     Running     0             15h
quickwit-janitor-6cf74f8dbf-r4jv8                    1/1     Running     0             43h
quickwit-metastore-75fbd6775-pltvv                   1/1     Running     0             43h
quickwit-searcher-0                                  1/1     Running     0             15h
quickwit-searcher-1                                  1/1     Running     0             15h
quickwit-searcher-2                                  1/1     Running     0             15h

indexer-0.txt indexer-1.txt metastore.txt searcher-0.txt searcher-1.txt

Using PostgreSQL metastore.

fmassot commented 11 months ago

Thanks for sharing this. I will have a look closely next week. If you need to fix that in the meantime, I would delete the cluster, wait for all nodes to shutdown and recreate the cluster. This will wipe out the chitchat state.

mingshun commented 10 months ago

I deleted the cluster and then recreated a new one 4 days ago. The cluster works fine so far.

fmassot commented 10 months ago

Ok, thanks for reporting that. One possible issue could come from chitchat, the gossiping library which Quickwit uses to form a cluster. We recently fixed a few issues on it and 0.6.5 does not include them. I will try to reproduce the problem with a lot of disconnection between nodes.