Grafana Connection Problems: "Cannot connect to Loki. 504. [object Object]"

slashr commented 1 year ago

Describe the bug Unable to add a Loki datasource to Grafana. It fails with the following error message "Loki: Cannot connect to Loki. 504. [object Object]"

To Reproduce Steps to reproduce the behavior:

Started Loki 2.6.1
Started Promtail 2.6.1
Tried to add Loki Datasource to Grafana

Expected behavior Datasource should be connected with the success message "Data source connected and labels found."

Environment:

Infrastructure: Kubernetes 1.23
Deployment tool: Terraform helm_provider
Loki Storage: GCP Cloud Storage Bucket
Distributed Loki Chart version loki-distributed-0.56.6

Screenshots, Promtail config, or terminal output Output of Loki Querier:

ts=2022-09-22T12:54:29.011098085Z caller=spanlogger.go:80 table-name=loki_index_19257 org_id=tenant-prod.example.com level=info msg="downloaded index set at query time" duration=45.652432384s
level=error ts=2022-09-22T12:55:29.415887781Z caller=series_index_store.go:527 org_id=tenant-prod.example.com msg="error querying storage" err="context canceled"
ts=2022-09-22T12:55:29.415969441Z caller=spanlogger.go:80 user=tenant-prod.example.com method=SeriesStore.LabelNamesForMetricName level=error msg=lookupLabelNamesBySeries err="context canceled"
ts=2022-09-22T12:55:29.416036391Z caller=spanlogger.go:80 user=tenant-prod.example.com method=query.Label level=info org_id=tenant-prod.example.com latency=slow query_type=labels length=10m0.057090582s duration=30.163831159s status=499 label= throughput=0B total_bytes=0B total_entries=0
level=error ts=2022-09-22T12:56:55.284315681Z caller=series_index_store.go:527 org_id=tenant-prod.example.com msg="error querying storage" err="context canceled"

What's weird is our Staging tenant Datasource gets added to Grafana without any problems

ts=2022-09-22T13:10:18.204311488Z caller=spanlogger.go:80 table-name=loki_index_19257 user-id=tenant-staging.example.com org_id=tenant-staging.example.com level=info msg="downloaded index set at query time" duration=49.414743ms
ts=2022-09-22T13:10:42.168925718Z caller=spanlogger.go:80 user=tenant-staging.example.com method=query.Label level=info org_id=tenant-staging.example.com latency=slow query_type=labels length=10m0.076577508s duration=24.037121869s status=200 label= throughput=0B total_bytes=0B total_entries

Chronology of events:

Using Loki non distributed mode since February this year. Worked without any problems
Around 3 weeks back, problems with connecting to the Loki Datasource
Fixed it by increasing the request_timeout of GCP Storage to 5s
This fixed the issue temporarily but again after 2-3 days, the Datasource was not reachable
Switched to Distributed Loki setup. Still no connection
Switched to Multi-Tenancy and now we have 3 tenant directories in the bucket instead of the default "fake" directory
Now, the staging and dev tenant datasources work fine, but prod still fails.

The promtail configuration YAML is identical except for the tenant ID.

chaudum commented 1 year ago

The log line

ts=2022-09-22T12:54:29.011098085Z caller=spanlogger.go:80 table-name=loki_index_19257 org_id=tenant-prod.example.com level=info msg="downloaded index set at query time" duration=45.652432384s

indicates that it took 45s to download the index from object storage. There is definitely a problem with the access to GCS. Can you verify with gsutil that you can access the index file.

slashr commented 1 year ago

The problem was not with Loki but with Promtail. We are adding some high cardinality labels to our log lines using Promtail. This is resulted in heavily indexed logs in the storage and Loki timed out when trying to fetch them.

Removing the high cardinality labels from the Promtail config fixed the issue for us.

grafana / loki

Grafana Connection Problems: "Cannot connect to Loki. 504. [object Object]" #7225