grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.95k stars 3.45k forks source link

loki distributors receiving uneven traffic #13113

Open madhu-reddy-peram opened 5 months ago

madhu-reddy-peram commented 5 months ago

Hi,

We are running a big loki cluster that accepts 677k lines per second and a volume of 295MiB logs per sec.

Recently after enabling autoscaling for distributors, we are seeing a strange behaviour where we see more requests routed to distributor pods that are newly created on to a new node which has more cpu and memory resources in our kubernetes cluster. This is causing rate limit errors in loki as new pods are receiving more traffic. We are using kubernetes distributor non headless service. The entry points to that are an ingress controller and a gaoling app that validates ingestion traffic before routing to distributors. We checked both the ingress and controller and gaoling app pods traffic and they have evenly distributed traffic. Only new distributor pods are receiving more traffic as shown in this - New distributor pods came up at 10:50 UTC on two new nodes and they started processing more requests. Is this expected behavior for distributors? Do distributors process traffic more quickly when the node has more resource capacity? We don't observe this behavior on a node with more than three distributors. They always receive the same amount of traffic. Could you shed some light on this behavior? Thanks.

image

I raised this question in loki slack but haven't received any response yet. Appreciate if someone give some insights on this issue.

madhu-reddy-peram commented 5 months ago

I raised this question in loki slack but haven't received any response yet. Appreciate if someone give some insights on this issue.

ftong2020 commented 5 months ago

Try less distributers. Our cluster is at similar size of yours(440k lines/s ,210MBps), and it only uses 4 distributers.

madhu-reddy-peram commented 5 months ago

Try less distributers. Our cluster is at similar size of yours(440k lines/s ,210MBps), and it only uses 4 distributers.

@ftong2020 - Thanks for your response on this. I will consider this option for sure.

One more clarification please, how many Ingesters are you running in your set up?Do you maintain any ratio between distributors and ingesters ? We have 45 ingesters(15 in each zone) and we set minimum distributors replicas to be 15 and max distributors replicas to be double the total ingesters(i.e 90). So we are using ingesters count to derive distributors min and max replicas.