Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam

ManuelSchmitzberger commented 1 year ago

Problem: Monstache doesn't back off writing data to Elasticsearch even on http status code 429.

Details:

What's happening: Monstache keeps sending data to Elasticsearch even when it's full. Elasticsearch says it's full by sending a 429 error, but Monstache ignores it and keeps trying. Monstache is flooding elasticsearch.
Problem caused: This causes a lot of logs to be made and fill up internal monitoring systems.

What Should Happen: Monstache should stop and wait before trying again when it gets a 429 error from Elasticsearch. This would help prevent too many logs and does not flood internal monitoring systems.

mologie commented 1 year ago

Hi, colleague of Manuel here. The specific error message we got was

ERROR 2023/11/24 15:43:43 Bulk response item: {"_index":"main.<col>","_id":"<id>","status":429,"error":{"type":"cluster_block_exception","reason":"index [main.<col>] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}

It was repeated 24 500 000 times in a duration of 10 minutes, totaling roughly 4 GiB of logs.

The steps to reproduce are (though we did not investigate yet whether these can be minimized):

Deny access to the monstache user, so that some data is queued up
Let Elasticsearch run almost full
Stop monstache
Restore access for monstache
Restart monstache
Let Elasticsearch run completely full (up to the flood-stage watermark)
Observe that monstache begins to rapidly generate log events (2+ million log entries per minute)

mologie commented 1 year ago

Additionally here is a redacted copy of the config file with which we observed the issue:

mongo-url = "mongodb://monstache:<snip:url>"
elasticsearch-urls = ["http://<snip>:9200"]
direct-read-namespaces = ["main.<snip:col>"]
change-stream-namespaces = ["main.<snip:col>"]
workers = ["worker-0", "worker-1"]
gzip = false
stats = true
index-stats = true
elasticsearch-user = "monstache"
elasticsearch-password = "<snip>"
elasticsearch-max-conns = 4
elasticsearch-validate-pem-file = false
elasticsearch-healthcheck-timeout-startup = 200
elasticsearch-healthcheck-timeout = 200
dropped-collections = true
dropped-databases = true
replay = true
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 1
index-files = true
file-highlighting = true
file-namespaces = ["users.fs.files"]
verbose = false
cluster-name = 'elasticsearch'
exit-after-direct-reads = false

I'm curious and investigating possible causes in the source code right now. A brief look tells me that the ElasticSearch library just indiscriminately calls the error handler for everything thrown at it via Add(), so if the ingress side works / provides data we'll end up with one error per ingested item. It's unclear to me however at which point throttling should best take place.

rwynn commented 11 months ago

Hi, pushed a new release to back off when indexing errors happen to mitigate the log flooding.

rwynn / monstache

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702