Open ManuelSchmitzberger opened 1 year ago
Hi, colleague of Manuel here. The specific error message we got was
ERROR 2023/11/24 15:43:43 Bulk response item: {"_index":"main.<col>","_id":"<id>","status":429,"error":{"type":"cluster_block_exception","reason":"index [main.<col>] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}
It was repeated 24 500 000 times in a duration of 10 minutes, totaling roughly 4 GiB of logs.
The steps to reproduce are (though we did not investigate yet whether these can be minimized):
Additionally here is a redacted copy of the config file with which we observed the issue:
mongo-url = "mongodb://monstache:<snip:url>"
elasticsearch-urls = ["http://<snip>:9200"]
direct-read-namespaces = ["main.<snip:col>"]
change-stream-namespaces = ["main.<snip:col>"]
workers = ["worker-0", "worker-1"]
gzip = false
stats = true
index-stats = true
elasticsearch-user = "monstache"
elasticsearch-password = "<snip>"
elasticsearch-max-conns = 4
elasticsearch-validate-pem-file = false
elasticsearch-healthcheck-timeout-startup = 200
elasticsearch-healthcheck-timeout = 200
dropped-collections = true
dropped-databases = true
replay = true
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 1
index-files = true
file-highlighting = true
file-namespaces = ["users.fs.files"]
verbose = false
cluster-name = 'elasticsearch'
exit-after-direct-reads = false
I'm curious and investigating possible causes in the source code right now. A brief look tells me that the ElasticSearch library just indiscriminately calls the error handler for everything thrown at it via Add(), so if the ingress side works / provides data we'll end up with one error per ingested item. It's unclear to me however at which point throttling should best take place.
Hi, pushed a new release to back off when indexing errors happen to mitigate the log flooding.
Problem: Monstache doesn't back off writing data to Elasticsearch even on http status code 429.
Details:
What Should Happen: Monstache should stop and wait before trying again when it gets a 429 error from Elasticsearch. This would help prevent too many logs and does not flood internal monitoring systems.