Closed BrianChristie closed 5 years ago
Thanks for the report @BrianChristie! This should not happen. Would you mind posting the last few 10s of lines from promtail before it exits? Do you know the exit code?
I failed to mention, this was with sending logs to logs-us-west1.grafana.net. I just gave it a try again and I'm not seeing the error now, presumably the backend capacity has been increased. Also I may have been mistaken about the process terminating.
Here's the prior logs:
@BrianChristie we've fixed a bunch of errors on the backend now, yeah - you shouldn't see as many 500s. I don't think the process was terminating either, or at least I've not been able to reproduce this.
It appears promtail is terminating (and the pod is restarting) when it receives a 500 error from the loki server.
"Error sending batch: Error doing write: 500 - 500 Internal Server Error"
From a discussion on Slack, this occurs when the remote end is overloaded. Possibly this should be a more specific503 slow down
error?Perhaps back-pressure from the remote end should be expected and handled by promtail, by retrying the request with a capped exponential backoff with jitter.
Additionally promtail could expose a metric indicating its consumer lag, that is the delta between the current head of the log file and what it has successfully processed sent to the remote server. That could be used in AlertManager to warn when there is a danger of loosing logs (for example in Kubernetes, Nodes automatically rotate and delete log files as they grow).