Open mingshun opened 11 months ago
The logs show that the pods running indexers and searchers cannot connect to your metastore. As a result, they never reach the ready state, and eventually, Kube restarts them. This cycle is repeating indefinitely. This is a symptom. If we fix the inability of the indexers and searchers to connect to the metastore, it will disappear.
Did you use the Quickwit Helm chart?
@mingshun can you share the logs of the metastore pod?
@fmassot metastore logs: quickwit.log
Thanks, the metastore seems to be fine.
Can you share the output of kubectl get pods
?
Did you try to restart the metastore or delete/create the cluster?
@fmassot
NAME READY STATUS RESTARTS AGE
quickwit-control-plane-8d58cfb78-bj958 1/1 Running 0 27h
quickwit-indexer-0 0/1 Running 0 26h
quickwit-indexer-1 0/1 Running 0 26h
quickwit-indexer-2 0/1 Running 0 26h
quickwit-janitor-6cf74f8dbf-r4jv8 1/1 Running 0 27h
quickwit-metastore-75fbd6775-pltvv 1/1 Running 0 27h
quickwit-searcher-0 0/1 Running 0 27h
quickwit-searcher-1 1/1 Running 0 27h
quickwit-searcher-2 1/1 Running 0 27h
searcher-2 is ready now. But indexers and searcher-0 are still unavailable.
I tried to restart all components. At the beginning, they joined the cluster. After several minutes, some of them left.
I have checked the network connectivity and no problem found.
Ok. Can you share the state of the cluster of quickwit-indexer-0
, quickwit-searcher-0
, quickwit-searcher-1
and quickwit-metastore-75fbd6775-pltvv
? The state is given by the endpoint http://quickwit-indexer-0:7280/api/v1/cluster
Additional question: are you using a file backed metastore or a PostgreSQL metastore?
I have tried to restart searchers and indexers yesterday. And the current pod status is:
NAME READY STATUS RESTARTS AGE
quickwit-control-plane-8d58cfb78-bj958 1/1 Running 0 43h
quickwit-indexer-0 0/1 Running 0 15h
quickwit-indexer-1 1/1 Running 1 (15h ago) 15h
quickwit-indexer-2 0/1 Running 0 15h
quickwit-janitor-6cf74f8dbf-r4jv8 1/1 Running 0 43h
quickwit-metastore-75fbd6775-pltvv 1/1 Running 0 43h
quickwit-searcher-0 1/1 Running 0 15h
quickwit-searcher-1 1/1 Running 0 15h
quickwit-searcher-2 1/1 Running 0 15h
indexer-0.txt indexer-1.txt metastore.txt searcher-0.txt searcher-1.txt
Using PostgreSQL metastore.
Thanks for sharing this. I will have a look closely next week. If you need to fix that in the meantime, I would delete the cluster, wait for all nodes to shutdown and recreate the cluster. This will wipe out the chitchat state.
I deleted the cluster and then recreated a new one 4 days ago. The cluster works fine so far.
Ok, thanks for reporting that. One possible issue could come from chitchat, the gossiping library which Quickwit uses to form a cluster. We recently fixed a few issues on it and 0.6.5 does not include them. I will try to reproduce the problem with a lot of disconnection between nodes.
Describe the bug I deploy quickwit on EKS. HTTP probes of indexers and searchers are failed with statuscode 503. Both indexers and searches output lots of the following logs repeatly:
metastore output lots of the following logs:
It seems searches and indexers are joining/leaving the cluster frequently. Quickwit search rest API is not working. I don't if ingest API is working because I can not get data from search API.
Steps to reproduce (if applicable) I have no idea about how to reproduce.
Expected behavior How to fix it?
Configuration: Please provide:
quickwit --version
index_id: my-index
doc_mapping: mode: strict field_mappings:
name: links type: array
tokenizer: raw
timestamp_field: span_start_timestamp_nanos
partition_key: hash_mod(service_name, 100)
tag_fields: [service_name]
indexing_settings: commit_timeout_secs: 5
search_settings: default_search_fields: []
retention: period: 5 days schedule: daily