grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.83k stars 3.44k forks source link

msg="Error in upgrading websocket" err="badResponseLoggingWriter: can't cast underlying response writer to Hijacker" #3499

Closed arcivanov closed 3 years ago

arcivanov commented 3 years ago

Running microservice deployment. Grafana runs queries perfectly, promtail pushes logs. However, live tailing does not work due to WebSocket upgrade failure. We have a standalone Grafana in the same cluster and WebSocket works there.

$ kubectl -n loki-test logs my-loki-distributed-querier-1  loki -f
level=info ts=2021-03-16T18:27:01.633676411Z caller=main.go:130 msg="Starting Loki" version="(version=2.2.0, branch=HEAD, revision=a27c79913)"
level=info ts=2021-03-16T18:27:01.634076185Z caller=server.go:229 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-03-16T18:27:01.63612292Z caller=shipper_index_client.go:100 msg="starting boltdb shipper in 1 mode"
level=info ts=2021-03-16T18:27:01.63871254Z caller=worker.go:109 msg="Starting querier worker connected to query-frontend" frontend=my-loki-distributed-query-frontend:9095
level=info ts=2021-03-16T18:27:01.640873955Z caller=module_service.go:59 msg=initialising module=memberlist-kv
level=info ts=2021-03-16T18:27:01.640873985Z caller=module_service.go:59 msg=initialising module=server
level=info ts=2021-03-16T18:27:01.64125056Z caller=module_service.go:59 msg=initialising module=ring
level=info ts=2021-03-16T18:27:01.641424912Z caller=module_service.go:59 msg=initialising module=ingester-querier
level=info ts=2021-03-16T18:27:01.641481163Z caller=module_service.go:59 msg=initialising module=store
level=info ts=2021-03-16T18:27:01.641521613Z caller=module_service.go:59 msg=initialising module=querier
level=info ts=2021-03-16T18:27:01.641733185Z caller=loki.go:248 msg="Loki started"

A failure occurs immediately when trying to get the live stream of the logs:

level=error ts=2021-03-17T02:24:02.016278236Z caller=http.go:217 org_id=dev traceID=9361ef54076a59e msg="Error in upgrading websocket" err="badResponseLoggingWriter: can't cast underlying response writer to Hijacker"
level=warn ts=2021-03-17T02:24:02.016350506Z caller=logging.go:71 traceID=9361ef54076a59e msg="GET /loki/api/v1/tail?query=%7Bapp%3D%22service-api%22%7D (500) 144.481µs Response: \"Internal Server Error\\n\" ws: true; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.9,ru;q=0.8; Cache-Control: no-cache; Connection: Upgrade; Content-Length: 0; Pragma: no-cache; Sec-Websocket-Extensions: permessage-deflate; client_max_window_bits; Sec-Websocket-Key: <snip>; Sec-Websocket-Version: 13; Upgrade: websocket; User-Agent: Grafana/7.2.1; X-B3-Parentspanid: 76abc8935fba5af9; X-B3-Sampled: 0; X-B3-Spanid: 9adc70d9c6c5f0a1; X-B3-Traceid: 63a9c1b3e6a72b6e15cfbb1f0bfc85fc; X-Envoy-Attempt-Count: 1; X-Envoy-External-Address: 100.116.218.195; X-Envoy-Original-Path: /grafana/api/datasources/proxy/3/loki/api/v1/tail?query=%7Bapp%3D%22service-api%22%7D; X-Forwarded-Client-Cert: By=spiffe://cluster.local/ns/loki-test/sa/my-loki-distributed;Hash=6568e502f254078a8c56b33d51a60b0db004de8b483025143a44c962098319af;Subject=\"\";URI=spiffe://cluster.local/ns/monitoring/sa/grafana; X-Forwarded-For: <snip>, 127.0.0.1, 127.0.0.1,100.116.218.195,100.116.218.195; X-Forwarded-Proto: http; X-Request-Id: 12fbbd67-0bde-471a-90ca-f6d27de30057; X-Scope-Orgid: dev; uber-trace-id: 09361ef54076a59e:09361ef54076a59e:0000000000000000:0; "
arcivanov commented 3 years ago

I presume that the failure is here:

https://github.com/grafana/cortex-tools/blob/16b83f14cc7202f61cf4bd7a0d69ed73e29caa12/vendor/github.com/gorilla/websocket/server.go#L178

but because the failure obscures the true cause, it's pretty hard to figure out what's going on.

arcivanov commented 3 years ago

The issue is localized to the query-frontend. The same query going into the front-end returned with 500 succeeds when directly querying the querier.

$ kubectl -n loki-test port-forward loki-distributed-querier-0 3100
Forwarding from 127.0.0.1:3100 -> 3100
Forwarding from [::1]:3100 -> 3100
Handling connection for 3100
$ curl -vvvv -H"X-Scope-OrgID: dev" -H 'Sec-WebSocket-Version: 13' -H 'Sec-WebSocket-Extensions: permessage-deflate' -H 'Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q=='  -H 'Connection: keep-alive, Upgrade' -H 'Upgrade: websocket' -H 'X-Hello: world' 'http://localhost:3100/loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22'
*   Trying ::1:3100...
* Connected to localhost (::1) port 3100 (#0)
> GET /loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22 HTTP/1.1
> Host: localhost:3100
> User-Agent: curl/7.71.1
> Accept: */*
> X-Scope-OrgID: dev
> Sec-WebSocket-Version: 13
> Sec-WebSocket-Extensions: permessage-deflate
> Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q==
> Connection: keep-alive, Upgrade
> Upgrade: websocket
> X-Hello: world
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 101 Switching Protocols
< Upgrade: websocket
< Connection: Upgrade
< Sec-WebSocket-Accept: ofNzHVvj/ErTqrfW0S1pvqEyhpw=
< 
�~��{"streams":[{"stream":{"job":"loki-test/loki-distributed","namespace":"loki-test","node_name":"ip-10-0-150-250.ec2.internal","pod":"loki-distributed-query-frontend-7466f56c8f-sjnhj","app":"loki-distributed","component":"query-frontend","container":"loki","filename"
$ kubectl -n loki-test port-forward loki-distributed-query-frontend-7466f56c8f-qblhv 3100
Forwarding from 127.0.0.1:3100 -> 3100
Forwarding from [::1]:3100 -> 3100
Handling connection for 3100
$ curl -vvvv -H"X-Scope-OrgID: dev" -H 'Sec-WebSocket-Version: 13' -H 'Sec-WebSocket-Extensions: permessage-deflate' -H 'Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q=='  -H 'Connection: keep-alive, Upgrade' -H 'Upgrade: websocket' -H 'X-Hello: world' 'http://localhost:3100/loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22'
*   Trying ::1:3100...
* Connected to localhost (::1) port 3100 (#0)
> GET /loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22 HTTP/1.1
> Host: localhost:3100
> User-Agent: curl/7.71.1
> Accept: */*
> X-Scope-OrgID: dev
> Sec-WebSocket-Version: 13
> Sec-WebSocket-Extensions: permessage-deflate
> Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q==
> Connection: keep-alive, Upgrade
> Upgrade: websocket
> X-Hello: world
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< Sec-Websocket-Version: 13
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< Date: Fri, 19 Mar 2021 03:45:25 GMT
< Content-Length: 22
< 
Internal Server Error
* Connection #0 to host localhost left intact
sherifkayad commented 3 years ago

@arcivanov are there any updates? Did you manage to solve it? .. I can confirm your comment above that directly querying the querier works but queruing the query-frontend gives an internal server error.

BTW I am using the latest version of loki 2.2.1 and deploying the distributed mode using Helm (using chart v0.31.2)

sherifkayad commented 3 years ago

@arcivanov After days of debugging I finally managed to find the problem. You basically have to explicitly configure the query frontend for tailing. The property I mention below is having a default value of "" which causes the 500 Error on the Query Frontend side.

## In the Loki Configs Query Frontend Part:
frontend:
      ..
      tail_proxy_url: http://YOUR_CLUSTER_INTERNAL_LOKI_QUERIER_SVC:3100

Of course YOUR_CLUSTER_INTERNAL_LOKI_QUERIER_SVC should be the service name of the querier.

Source: https://grafana.com/docs/loki/latest/configuration/#query_frontend_config

I will open a PR to the Helm Charts repo and will post the link here so that people using Helm don't have to suffer like I did 😄

sherifkayad commented 3 years ago

Here's my PR to solve this on the Helm Chart level https://github.com/grafana/helm-charts/pull/456

arcivanov commented 3 years ago

@arcivanov are there any updates? Did you manage to solve it?

We downgraded to the previous working version.

arcivanov commented 3 years ago

Here's my PR to solve this on the Helm Chart level grafana/helm-charts#456

That's wonderful, thank you so much for this!!!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

avinashs433 commented 3 years ago

Hi, we also are experiencing this issue on the following version: grafana version : 8.1.5 loki version : 2.3.0

sherifkayad commented 3 years ago

@avinashs433 It should be a config issue you need to adjust ,, Check my comment https://github.com/grafana/loki/issues/3499#issuecomment-849487447