Open tim-sendible opened 3 years ago
The same thing. I think the reason that ring service doesn't route traffic to non-ready pods. I Will try it later
The same thing. I think the reason that ring service doesn't route traffic to non-ready pods. I Will try it later
nope, it's not helped, any ideas?
Loki-distributed currently is very raw, hope it will become more user-friednly
Can't make it work stable and getting the same err="empty ring"
We are facing the same issues. Are you guys using a custom (vendor) CNI or the AWS bundled one?
I configured loki-distributed
using the example setup in the repository on EKS last week and it works nicely. It performs surprisingly well.
It's running 1.19.6-eks
using mostly default configuration from when you click "create cluster" from the AWS Console.
We are facing the same issues. Are you guys using a custom (vendor) CNI or the AWS bundled one?
We're tried on gke and aks with calico addons.
for us it works with bare aws eks and bundled CNI. It fails with the symptoms described here when using cilium in overlay mode.
Loki-distributed currently is very raw, hope it will become more user-friednly
@Zeka13 Can you elaborate? The chart does work very well and is pretty full-featured.
The issue here is probably not related to the chart. I'd suggest to reach out in Grafana's community forum or on Grafana Slack to get help on EKS specific issues.
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Did you ever figure out how to get this to run?
Adding the
-memberlist.bind_addr=127.0.0.1
Cli flag to all systems allowed them to start up. Running on GKE.
Further work into this I've found that the following allows full RING communication on GKE.
distributor:
replicas: 2
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
You will want to set this on all loki containers.
Hi Guys, I am also facing the similar issue in my Loki MS setup, I have deployed on AWS EKS cluster v1.20. But functionality wise it is working fine. not sure why we are getting this error in the loki distributor logs.
please suggest should we safely ignore or need to look on it, I have even set the resource limits & requests as well for the distributor container still seeing the errors in logs.
Please help!!
For anyone stumbling into this problem I suggest you take a look at this issue on the Thanos project which explains what was happening in my case.
In my case I was deploying into a private k8s cluster on Azure and incorrectly configured my IP address range which was then being filtered out because it was not in this list.
If you are deploying Loki (or even Tempo) in a distributed mode in a private cluster using memberlist make sure that you are using valid private IP addresses on your subnets.
10.0.0.0/8
100.64.0.0/10
172.16.0.0/12
192.88.99.0/24
192.168.0.0/16
198.18.0.0/15
Hi, please share if anyone having a solution for this.
@unguiculus yes, I can elaborate. As you can see many people have problems even starting using this chart and this very ticket is still open
Personally, I don't have the resources to maintain loki charts right now, so I will not follow your advice about contacting grafana community, I simply will not use these broken charts
we use this chart in production. it does work, you just need to tell loki what addresses to bind to. The solutions are in this issue.
Shouldn't the code be default for all the components for Tempo, Loki and Mimir since they are very likely to suffer from the same issue? I was thinking as part of the template, maybe behind a flag to be able to disable them if required with a lovely reference to this issue or Thanos issue and the list of valid CIDRs
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Did you ever figure out how to get this to run?
Adding the
-memberlist.bind_addr=127.0.0.1
Cli flag to all systems allowed them to start up. Running on GKE.
Further work into this I've found that the following allows full RING communication on GKE.
distributor: replicas: 2 extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP extraArgs: - -memberlist.bind-addr=$(MY_POD_IP)
You will want to set this on all loki containers.
thank you,This solution solved the problem
Shouldn't the code be default for all the components for Tempo, Loki and Mimir since they are very likely to suffer from the same issue? I was thinking as part of the template, maybe behind a flag to be able to disable them if required with a lovely reference to this issue or Thanos issue and the list of valid CIDRs
extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP extraArgs: - -memberlist.bind-addr=$(MY_POD_IP)
thanks! you help me.
Helm chart loki-distributor
version 0.74.6
doesn't seem to need this workaround and actually it fails to start complaining that the address is already binded
I think we might be able to close this issue now
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Did you ever figure out how to get this to run?
Adding the
-memberlist.bind_addr=127.0.0.1
Cli flag to all systems allowed them to start up. Running on GKE.
Further work into this I've found that the following allows full RING communication on GKE.
distributor: replicas: 2 extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP extraArgs: - -memberlist.bind-addr=$(MY_POD_IP)
You will want to set this on all loki containers.
I am using Azure CNI Overlay and this worked for me
You may run into this issue if you try to deploy with this method using the latest charts:
reference https://github.com/grafana/loki/issues/10797
this needs to be updated in your values
structuredConfig:
memberlist:
bind_addr: []
Had same problem with EKS cluster. Had to do with what @jmadureira said, the Service IPv4 range
had to be changed.
Its so berzerk nobody actually posts the whole thing. Here is the values.yaml if you want to incorporate fully for loki-distributed chart
loki:
structuredConfig:
memberlist:
bind_addr: []
distributor:
replicas: 1
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
querier:
replicas: 1
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
queryFrontend:
replicas: 1
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
ingester:
replicas: 1
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
I'm running on EKS 1.18
If I take the loki-distributed helm chart and apply it with the values.yml as it is written, I end up with the distributor, ingester and querier in a crashloopbackoff state complaining
failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided
.This seems to be a reasonably common error related to the memberlist. I understand that I should provide a private IP address. However it's unclear what address I should be adding.
If I add:
to the memberlist config, things seem to get a little further. All the containers at least go ready, but they eventually fail, and the ring fails to get any members added to it (shown by navigating to the /ring url of the distributor service)
127.0.0.1 is a complete guess based on trial and error as I can't find any documentation explaining what IP address I should be applying here.
I have also tried 172.120.0.0/16 which is the cidr range of IP addresses available to my pods. This time, I see see ingester being added to the ring. It is even temporarily healthy, before the state goes to 'unhealthy' and everything grinds to a halt again.
Here are some logs from the ingester that may or may not be useful? By this point, the state of the instance in the ring is 'unhealthy', even though it seems to be uploading the tables somewhere? Also during this time, both the querier and the distributor are reporting
err="empty ring"