grafana / helm-charts

Apache License 2.0
1.62k stars 2.25k forks source link

[loki-distributed] not ready: number of queriers connected to query-frontend is 0 #2295

Open kladiv opened 1 year ago

kladiv commented 1 year ago

Hello, concerning issue https://github.com/grafana/helm-charts/issues/2028 i still got error below when queryFrontend.replicas is 2:

level=info ts=2023-03-27T09:12:12.813260784Z caller=module_service.go:82 msg=initialising module=cache-generation-loader
level=info ts=2023-03-27T09:12:12.813268147Z caller=module_service.go:82 msg=initialising module=server
level=info ts=2023-03-27T09:12:12.813276593Z caller=module_service.go:82 msg=initialising module=usage-report
level=info ts=2023-03-27T09:12:12.813280821Z caller=module_service.go:82 msg=initialising module=runtime-config
level=info ts=2023-03-27T09:12:12.813430544Z caller=module_service.go:82 msg=initialising module=query-frontend-tripperware
level=info ts=2023-03-27T09:12:12.813443759Z caller=module_service.go:82 msg=initialising module=query-frontend
level=info ts=2023-03-27T09:12:12.813500386Z caller=loki.go:461 msg="Loki started"
level=info ts=2023-03-27T09:12:42.279652042Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=error ts=2023-03-27T09:12:46.708369563Z caller=reporter.go:203 msg="failed to delete corrupted cluster seed file, deleting it" err="BadRequest: Invalid token.\n\tstatus code: 400, request id: txbc43b8e6a0384bc79bcdb-0064215e0e, host id: txbc43b8e6a0384bc79bcdb-0064215e0e"
level=info ts=2023-03-27T09:12:52.279587151Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=info ts=2023-03-27T09:13:02.280118627Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=info ts=2023-03-27T09:13:12.280012774Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=info ts=2023-03-27T09:13:22.280004704Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"

I checked and the headless service seems present:

 $ kubectl -n logging-new get svc
NAME                                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
loki-new-loki-distributed-compactor                 ClusterIP   10.43.20.194    <none>        3100/TCP                     19m
loki-new-loki-distributed-distributor               ClusterIP   10.43.167.185   <none>        3100/TCP,9095/TCP            18m
loki-new-loki-distributed-gateway                   ClusterIP   10.43.123.201   <none>        80/TCP                       18m
loki-new-loki-distributed-ingester                  ClusterIP   10.43.53.24     <none>        3100/TCP,9095/TCP            19m
loki-new-loki-distributed-ingester-headless         ClusterIP   None            <none>        3100/TCP,9095/TCP            19m
loki-new-loki-distributed-memberlist                ClusterIP   None            <none>        7946/TCP                     19m
loki-new-loki-distributed-querier                   ClusterIP   10.43.61.180    <none>        3100/TCP,9095/TCP            18m
loki-new-loki-distributed-querier-headless          ClusterIP   None            <none>        3100/TCP,9095/TCP            19m
loki-new-loki-distributed-query-frontend            ClusterIP   10.43.175.74    <none>        3100/TCP,9095/TCP,9096/TCP   18m
loki-new-loki-distributed-query-frontend-headless   ClusterIP   None            <none>        3100/TCP,9095/TCP,9096/TCP   19m

Helm chart version is 0.69.9

Why i'm still got this?

Could it be 'caused by the spec below in values.yaml that should point the headless endpoint? https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L181

Thank you

heesuk-ahn commented 1 year ago

in my case, When the querier starts, a connection is made to the query frontend.

However, since the querier accesses through the service of the query frontend, it seems that 1:1 mapping may not be possible if there are multiple query frontends.

So, if you increase the querier or decrease the query frontend, it seems that the querier and the query frontend are connected.

eg) query-frontend : "1" , querier : "3"

kladiv commented 1 year ago

I guess it's not related to replicas ratio. I guess it should be related to this model: https://grafana.com/docs/loki/latest/configuration/query-frontend/#grpc-mode-pull-model

kworkbee commented 1 year ago

I'm also getting trouble with this. It seems that publishNotReadyAddresses is missing from the querier headless Service. May this matter?

sjentzsch commented 1 year ago

Same here, with 2 querier and 2 query-frontend pods. As @kladiv mentioned, I changed https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L181 to the headless-service and then it worked. Not sure if that's the solution though, or if side-effects can be expected.

kworkbee commented 1 year ago

In my case, it seems that the change to the headless service does not work normally. Querier has four replicas and query frontend has two replicas, each with an autoscaling option enabled, resulting in a crashloopbackoff (distributor / ingester / querier / queryFrontend).

rotarur commented 1 year ago

I'm also facing this error.

If I disable the queryScheduler it works fine

diranged commented 1 year ago

We're seeing this as well..

LukaszRacon commented 1 year ago

Please check latest release. Frontend address was adjusted in loki-distributed-0.69.13: https://github.com/grafana/helm-charts/commit/3829417e0d113d24ea82ff9f0c6c631d20f95822 I no longer see this issue in helm-loki-5.2.0 chart.

9numbernine9 commented 1 year ago

We also deployed the 5.2.0 Helm chart to some of our environments today and the issue appears to be resolved. :+1:

dorkamotorka commented 1 year ago

I encountered the same issue in mimir-distributed helm chart and I resolved it by configuring frontend_worker.scheduler_address parameter. More info here: https://grafana.com/docs/mimir/latest/references/configuration-parameters/#frontend_worker

sojjan1337 commented 8 months ago

Using s3 all the way solved this for me. Using filesystem with loki-distributed did make some weird problems

snk-actian commented 8 months ago

I have the same error with grafana/loki (simple scalable) deployment Helm Chart version 5.2.0. It deploys 3 loki-read pods and only one gives that error, the other two is happy.

Edit: After restarting that failing pod, it becomes healthy.

bilsch-nice commented 6 months ago

I am having this same issue. My environment is running on istio with mutual tls enabled. If I disable mutual tls everything works fine.

❯ helm ls -n loki
NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
loki    loki            2               2024-03-21 15:57:58.90046894 -0400 EDT  deployed        loki-distributed-0.78.3 2.9.4