grafana / helm-charts

Apache License 2.0
1.62k stars 2.25k forks source link

Loki-distributed doesn't work out of the box #157

Open tim-sendible opened 3 years ago

tim-sendible commented 3 years ago

I'm running on EKS 1.18

If I take the loki-distributed helm chart and apply it with the values.yml as it is written, I end up with the distributor, ingester and querier in a crashloopbackoff state complaining failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided.

This seems to be a reasonably common error related to the memberlist. I understand that I should provide a private IP address. However it's unclear what address I should be adding.

If I add:

    bind_addr:
        - 127.0.0.1

to the memberlist config, things seem to get a little further. All the containers at least go ready, but they eventually fail, and the ring fails to get any members added to it (shown by navigating to the /ring url of the distributor service)

127.0.0.1 is a complete guess based on trial and error as I can't find any documentation explaining what IP address I should be applying here.

I have also tried 172.120.0.0/16 which is the cidr range of IP addresses available to my pods. This time, I see see ingester being added to the ring. It is even temporarily healthy, before the state goes to 'unhealthy' and everything grinds to a halt again.

Here are some logs from the ingester that may or may not be useful? By this point, the state of the instance in the ring is 'unhealthy', even though it seems to be uploading the tables somewhere? Also during this time, both the querier and the distributor are reporting err="empty ring"

│ level=info ts=2021-01-04T15:35:16.752555087Z caller=lifecycler.go:547 msg="instance not found in ring, adding with no tokens" ring=ingester                                                                                  │
│ level=info ts=2021-01-04T15:35:16.752736192Z caller=lifecycler.go:394 msg="auto-joining cluster after timeout" ring=ingester                                                                                                 │
│ level=info ts=2021-01-04T15:35:16.756522831Z caller=memberlist_client.go:461 msg="joined memberlist cluster" reached_nodes=2                                                                                                 │
│ ts=2021-01-04T15:35:17.753163266Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-querier-0-104bc685' from=[::]:7946"                                                │
│ ts=2021-01-04T15:35:18.254113836Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-querier-0-104bc685 from=127.0.0.1:54602"                                            │
│ ts=2021-01-04T15:35:18.254175417Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-querier-0-104bc685' from=[::]:7946"                                                │
│ ts=2021-01-04T15:35:18.254210598Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF"                                                                                                                 │
│ ts=2021-01-04T15:35:18.752862962Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-querier-0-104bc685 has failed, no acks received"                                                         │
│ ts=2021-01-04T15:35:18.753670742Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f' from=[::]:7946"                              │
│ ts=2021-01-04T15:35:19.254242102Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f from=127.0.0.1:54638"                          │
│ ts=2021-01-04T15:35:19.254377786Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF"                                                                                                                 │
│ ts=2021-01-04T15:35:20.75323704Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f has failed, no acks received"                                        │
│ ts=2021-01-04T15:35:21.753527161Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-querier-0-104bc685' from=[::]:7946"                                                │
│ ts=2021-01-04T15:35:22.254288357Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-querier-0-104bc685 from=127.0.0.1:54706"                                            │
│ ts=2021-01-04T15:35:22.254566363Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF"                                                                                                                 │
│ ts=2021-01-04T15:35:22.753134246Z caller=memberlist_logger.go:74 level=info msg="Marking loki-test-loki-distributed-querier-0-104bc685 as failed, suspect timeout reached (0 peer confirmations)"                            │
│ ts=2021-01-04T15:35:24.752835308Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-querier-0-104bc685 has failed, no acks received"                                                         │
│ ts=2021-01-04T15:35:24.753416282Z caller=memberlist_logger.go:74 level=info msg="Marking loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f as failed, suspect timeout reached (0 peer confirmations)"          │
│ ts=2021-01-04T15:35:24.754445635Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f' from=[::]:7946"                              │
│ ts=2021-01-04T15:35:25.255044246Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f from=127.0.0.1:54800"                          │
│ ts=2021-01-04T15:35:25.255200831Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF"                                                                                                                 │
│ ts=2021-01-04T15:35:28.756425213Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f has failed, no acks received"                                       │
│ level=info ts=2021-01-04T15:36:16.751323445Z caller=table_manager.go:171 msg="uploading tables"
unittolabs commented 3 years ago

The same thing. I think the reason that ring service doesn't route traffic to non-ready pods. I Will try it later

unittolabs commented 3 years ago

The same thing. I think the reason that ring service doesn't route traffic to non-ready pods. I Will try it later

nope, it's not helped, any ideas?

mossad-zika commented 3 years ago

Loki-distributed currently is very raw, hope it will become more user-friednly

Can't make it work stable and getting the same err="empty ring"

twiechert commented 3 years ago

We are facing the same issues. Are you guys using a custom (vendor) CNI or the AWS bundled one?

jandragsbaek commented 3 years ago

I configured loki-distributed using the example setup in the repository on EKS last week and it works nicely. It performs surprisingly well.

It's running 1.19.6-eks using mostly default configuration from when you click "create cluster" from the AWS Console.

unittolabs commented 3 years ago

We are facing the same issues. Are you guys using a custom (vendor) CNI or the AWS bundled one?

We're tried on gke and aks with calico addons.

twiechert commented 3 years ago

for us it works with bare aws eks and bundled CNI. It fails with the symptoms described here when using cilium in overlay mode.

unguiculus commented 3 years ago

Loki-distributed currently is very raw, hope it will become more user-friednly

@Zeka13 Can you elaborate? The chart does work very well and is pretty full-featured.

The issue here is probably not related to the chart. I'd suggest to reach out in Grafana's community forum or on Grafana Slack to get help on EKS specific issues.

Kampe commented 3 years ago

Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.

gyoza commented 3 years ago

Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.

Did you ever figure out how to get this to run?

Adding the

-memberlist.bind_addr=127.0.0.1

Cli flag to all systems allowed them to start up. Running on GKE.

Further work into this I've found that the following allows full RING communication on GKE.

distributor:
  replicas: 2
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

You will want to set this on all loki containers.

naveen210121 commented 2 years ago

Hi Guys, I am also facing the similar issue in my Loki MS setup, I have deployed on AWS EKS cluster v1.20. But functionality wise it is working fine. not sure why we are getting this error in the loki distributor logs.

please suggest should we safely ignore or need to look on it, I have even set the resource limits & requests as well for the distributor container still seeing the errors in logs.

Please help!!

jmadureira commented 2 years ago

For anyone stumbling into this problem I suggest you take a look at this issue on the Thanos project which explains what was happening in my case.

In my case I was deploying into a private k8s cluster on Azure and incorrectly configured my IP address range which was then being filtered out because it was not in this list.

If you are deploying Loki (or even Tempo) in a distributed mode in a private cluster using memberlist make sure that you are using valid private IP addresses on your subnets.

10.0.0.0/8
100.64.0.0/10
172.16.0.0/12
192.88.99.0/24
192.168.0.0/16
198.18.0.0/15
sreejithsoman-mc commented 2 years ago

Hi, please share if anyone having a solution for this.

mossad-zika commented 2 years ago

@unguiculus yes, I can elaborate. As you can see many people have problems even starting using this chart and this very ticket is still open

Personally, I don't have the resources to maintain loki charts right now, so I will not follow your advice about contacting grafana community, I simply will not use these broken charts

jmell-slg commented 2 years ago

we use this chart in production. it does work, you just need to tell loki what addresses to bind to. The solutions are in this issue.

carlosjgp commented 2 years ago

Shouldn't the code be default for all the components for Tempo, Loki and Mimir since they are very likely to suffer from the same issue? I was thinking as part of the template, maybe behind a flag to be able to disable them if required with a lovely reference to this issue or Thanos issue and the list of valid CIDRs

  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)
vincent927 commented 1 year ago

Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.

Did you ever figure out how to get this to run?

Adding the

-memberlist.bind_addr=127.0.0.1

Cli flag to all systems allowed them to start up. Running on GKE.

Further work into this I've found that the following allows full RING communication on GKE.

distributor:
  replicas: 2
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

You will want to set this on all loki containers.

thank you,This solution solved the problem

saitama-24 commented 1 year ago

Shouldn't the code be default for all the components for Tempo, Loki and Mimir since they are very likely to suffer from the same issue? I was thinking as part of the template, maybe behind a flag to be able to disable them if required with a lovely reference to this issue or Thanos issue and the list of valid CIDRs

  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

thanks! you help me.

carlosjgp commented 1 year ago

Helm chart loki-distributor version 0.74.6 doesn't seem to need this workaround and actually it fails to start complaining that the address is already binded

I think we might be able to close this issue now

jjayabal23 commented 10 months ago

Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.

Did you ever figure out how to get this to run?

Adding the

-memberlist.bind_addr=127.0.0.1

Cli flag to all systems allowed them to start up. Running on GKE.

Further work into this I've found that the following allows full RING communication on GKE.

distributor:
  replicas: 2
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

You will want to set this on all loki containers.

I am using Azure CNI Overlay and this worked for me

gyoza commented 9 months ago

You may run into this issue if you try to deploy with this method using the latest charts:

reference https://github.com/grafana/loki/issues/10797

this needs to be updated in your values

 structuredConfig:
    memberlist:
      bind_addr: []
RachelNaane commented 7 months ago

Had same problem with EKS cluster. Had to do with what @jmadureira said, the Service IPv4 range had to be changed.

pulsedynamic commented 3 months ago

Its so berzerk nobody actually posts the whole thing. Here is the values.yaml if you want to incorporate fully for loki-distributed chart

loki:
   structuredConfig:
    memberlist:
      bind_addr: []

distributor:
  replicas: 1
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

querier:
  replicas: 1
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

queryFrontend:
  replicas: 1
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)

ingester:
  replicas: 1
  extraEnv:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  extraArgs:
    - -memberlist.bind-addr=$(MY_POD_IP)