grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.94k stars 3.45k forks source link

promtail terminates when loki returns a 500 error from back-pressure #89

Closed BrianChristie closed 5 years ago

BrianChristie commented 5 years ago

It appears promtail is terminating (and the pod is restarting) when it receives a 500 error from the loki server. "Error sending batch: Error doing write: 500 - 500 Internal Server Error" From a discussion on Slack, this occurs when the remote end is overloaded. Possibly this should be a more specific 503 slow down error?

Perhaps back-pressure from the remote end should be expected and handled by promtail, by retrying the request with a capped exponential backoff with jitter.

Additionally promtail could expose a metric indicating its consumer lag, that is the delta between the current head of the log file and what it has successfully processed sent to the remote server. That could be used in AlertManager to warn when there is a danger of loosing logs (for example in Kubernetes, Nodes automatically rotate and delete log files as they grow).

tomwilkie commented 5 years ago

Thanks for the report @BrianChristie! This should not happen. Would you mind posting the last few 10s of lines from promtail before it exits? Do you know the exit code?

BrianChristie commented 5 years ago

I failed to mention, this was with sending logs to logs-us-west1.grafana.net. I just gave it a try again and I'm not seeing the error now, presumably the backend capacity has been increased. Also I may have been mistaken about the process terminating.

Here's the prior logs:

```   10:00:52.198 promtail level=info ts=2018-12-13T10:00:46.685374757Z caller=kubernetes.go:178 component=discovery discovery=k8s msg="Using pod service account via in-cluster config"   10:00:52.198 promtail level=info ts=2018-12-13T10:00:46.686164863Z caller=kubernetes.go:178 component=discovery discovery=k8s msg="Using pod service account via in-cluster config"   10:00:52.381 promtail level=info ts=2018-12-13T10:00:52.381542318Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"node-exporter\", controller_revision_hash=\"3272103897\", instance=\"node-exporter-2lkz4\", job=\"monitoring/node-exporter\", namespace=\"monitoring\", pod_template_generation=\"6\"}"   10:00:52.381 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"node-exporter\", controller_revision_hash=\"3272103897\", instance=\"node-exporter-2lkz4\", job=\"monitoring/node-exporter\", namespace=\"monitoring\", pod_template_generation=\"6\"}"   10:00:52.484 promtail level=info ts=2018-12-13T10:00:52.484306778Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"cluster-fluentd-cloudwatch\", controller_revision_hash=\"811795448\", instance=\"cluster-fluentd-cloudwatch-t2mrv\", job=\"cluster-fluentd-cloudwatch/cluster-fluentd-cloudwatch\", namespace=\"cluster-fluentd-cloudwatch\", pod_template_generation=\"2\"}"   10:00:52.484 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"cluster-fluentd-cloudwatch\", controller_revision_hash=\"811795448\", instance=\"cluster-fluentd-cloudwatch-t2mrv\", job=\"cluster-fluentd-cloudwatch/cluster-fluentd-cloudwatch\", namespace=\"cluster-fluentd-cloudwatch\", pod_template_generation=\"2\"}"   10:00:52.490 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/eeb66730-d92b-11e8-b4be-0ab5bd2076ce/cluster-fluentd-cloudwatch/0.log - &{Offset:2450983 Whence:0}   10:00:52.501 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/fda8ecf6-f3d4-11e8-aa30-02194dce8c18/node-exporter/0.log - &{Offset:5617 Whence:0}   10:00:52.501 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/fda8ecf6-f3d4-11e8-aa30-02194dce8c18/node-exporter/0.log"   10:00:52.586 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/eeb66730-d92b-11e8-b4be-0ab5bd2076ce/cluster-fluentd-cloudwatch/0.log"   10:00:52.587 promtail level=info ts=2018-12-13T10:00:52.587002987Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"kiam\", controller_revision_hash=\"2838842168\", instance=\"kiam-agent-7kp8m\", job=\"cluster-kiam/kiam\", namespace=\"cluster-kiam\", pod_template_generation=\"1\", role=\"agent\"}"   10:00:52.587 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"kiam\", controller_revision_hash=\"2838842168\", instance=\"kiam-agent-7kp8m\", job=\"cluster-kiam/kiam\", namespace=\"cluster-kiam\", pod_template_generation=\"1\", role=\"agent\"}"   10:00:52.681 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/50127684-d604-11e8-b4be-0ab5bd2076ce/agent/0.log - &{Offset:5994548 Whence:0}   10:00:52.681 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/50127684-d604-11e8-b4be-0ab5bd2076ce/agent/0.log"   10:00:52.685 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/1.log - &{Offset:597 Whence:0}   10:00:52.685 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/1.log"   10:00:52.685 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/2.log - &{Offset:2992790 Whence:0}   10:00:52.685 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/2.log"   10:00:52.884 promtail level=info ts=2018-12-13T10:00:52.884543451Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"flannel\", controller_revision_hash=\"3331141618\", instance=\"kube-flannel-ds-x4plh\", job=\"kube-system/flannel\", namespace=\"kube-system\", pod_template_generation=\"2\", role_kubernetes_io_networking=\"1\", tier=\"node\"}"   10:00:52.884 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"flannel\", controller_revision_hash=\"3331141618\", instance=\"kube-flannel-ds-x4plh\", job=\"kube-system/flannel\", namespace=\"kube-system\", pod_template_generation=\"2\", role_kubernetes_io_networking=\"1\", tier=\"node\"}"   10:00:53.083 promtail level=info ts=2018-12-13T10:00:53.082819354Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"scalyr-agent-2\", controller_revision_hash=\"154853095\", instance=\"scalyr-agent-2-8w5t4\", job=\"cluster-scalyr/scalyr-agent-2\", namespace=\"cluster-scalyr\", pod_template_generation=\"1\"}"   10:00:53.083 promtail time="2018-12-13T10:00:53Z" level=info msg="newTarget{app=\"scalyr-agent-2\", controller_revision_hash=\"154853095\", instance=\"scalyr-agent-2-8w5t4\", job=\"cluster-scalyr/scalyr-agent-2\", namespace=\"cluster-scalyr\", pod_template_generation=\"1\"}"   10:00:53.088 promtail level=info ts=2018-12-13T10:00:53.088129332Z caller=targetmanager.go:140 msg="Adding target" key="{instance=\"promtail-cm4fn\", job=\"cluster-loki/promtail\", namespace=\"cluster-loki\"}"   10:00:53.088 promtail time="2018-12-13T10:00:53Z" level=info msg="newTarget{instance=\"promtail-cm4fn\", job=\"cluster-loki/promtail\", namespace=\"cluster-loki\"}"   10:00:53.181 promtail 2018/12/13 10:00:53 Seeked /var/log/pods/7c5c1158-fa11-11e8-aa30-02194dce8c18/scalyr-agent/0.log - &{Offset:0 Whence:0}   10:00:53.181 promtail time="2018-12-13T10:00:53Z" level=info msg="Tailing file: /var/log/pods/7c5c1158-fa11-11e8-aa30-02194dce8c18/scalyr-agent/0.log"   10:00:53.284 promtail 2018/12/13 10:00:53 Seeked /var/log/pods/ef063e24-febd-11e8-aa30-02194dce8c18/promtail/0.log - &{Offset:0 Whence:0}   10:00:53.382 promtail time="2018-12-13T10:00:53Z" level=info msg="Tailing file: /var/log/pods/ef063e24-febd-11e8-aa30-02194dce8c18/promtail/0.log"   10:00:53.383 promtail 2018/12/13 10:00:53 Seeked /var/log/pods/0525c5e0-a703-11e8-b4be-0ab5bd2076ce/kube-flannel/0.log - &{Offset:4735 Whence:0}   10:00:53.383 promtail time="2018-12-13T10:00:53Z" level=info msg="Tailing file: /var/log/pods/0525c5e0-a703-11e8-b4be-0ab5bd2076ce/kube-flannel/0.log"   10:00:58.661 promtail time="2018-12-13T10:00:58Z" level=error msg="Error sending batch: Error doing write: 500 - 500 Internal Server Error" ```
tomwilkie commented 5 years ago

@BrianChristie we've fixed a bunch of errors on the backend now, yeah - you shouldn't see as many 500s. I don't think the process was terminating either, or at least I've not been able to reproduce this.