grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.28k stars 3.37k forks source link

Missing logs on Loki and receiving 500 “context canceled” response on /loki/api/v1/push #4221

Open sdomnin opened 3 years ago

sdomnin commented 3 years ago

Describe the bug Logs are partially missing and Loki logged "msg="POST /loki/api/v1/push (500) 1.045966ms Response: \"context canceled\\n\""

To Reproduce Steps to reproduce the behavior:

  1. Started Loki (latest version)
  2. Using Java Logback appender for Grafana Loki
  3. Query: {} term

Expected behavior All the logs to be available on Loki side

Environment:

Screenshots, Promtail config, or terminal output Loki config:

Screenshot 2021-08-26 at 19 49 04

Loki distributor logs: ts=2021-08-26T15:14:23.50772613Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-dev-loki-distributed-querier-0-c0f1bba2 has failed, no acks received" level=warn ts=2021-08-26T15:14:25.043936738Z caller=logging.go:71 traceID=39822c11f0d6f598 msg="POST /loki/api/v1/push (500) 940.119µs Response: \"context canceled\\n\" ws: false; Connection: close; Content-Length: 4394; Content-Type: application/x-protobuf; Traceparent: 00-22e7da0ec95c335ef6213b36cd75806f-18317a0327006465-01; X-Forwarded-For: 192.168.118.225; X-Forwarded-Host: test-dev-loki.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Original-Uri: /loki/api/v1/push; X-Real-Ip: 192.168.118.225; X-Request-Id: 2c0a704594e60e95b0421a174b12d29e; X-Scheme: https; " level=warn ts=2021-08-26T15:14:47.946706075Z caller=logging.go:71 traceID=ffe66da4c4be9d5 msg="POST /loki/api/v1/push (500) 1.645579ms Response: \"context canceled\\n\" ws: false; Connection: close; Content-Length: 23983; Content-Type: application/x-protobuf; Traceparent: 00-21c318103a1df1c95d6a78a091ce3834-3d522777a20075be-01; X-Forwarded-For: 192.168.118.225; X-Forwarded-Host: test-dev-loki.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Original-Uri: /loki/api/v1/push; X-Real-Ip: 192.168.118.225; X-Request-Id: 46a85f096e9355f0cf71df3d14f00e3e; X-Scheme: https; " ts=2021-08-26T15:16:02.507542394Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to loki-dev-loki-distributed-querier-0-c0f1bba2 but other probes failed, network may be misconfigured" ts=2021-08-26T15:16:05.507623495Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 192.168.110.93:57560->192.168.87.222:7946: i/o timeout" ts=2021-08-26T15:16:05.507839879Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-dev-loki-distributed-querier-0-c0f1bba2 has failed, no acks received" ts=2021-08-26T15:16:07.507627484Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 192.168.110.93:57636->192.168.87.222:7946: i/o timeout" ts=2021-08-26T15:16:07.507720092Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-dev-loki-distributed-querier-0-c0f1bba2 has failed, no acks received" level=warn ts=2021-08-26T15:16:44.109985853Z caller=logging.go:71 traceID=177340f20f2ccb0a msg="POST /loki/api/v1/push (500) 1.045966ms Response: \"context canceled\\n\" ws: false; Connection: close; Content-Length: 6314; Content-Type: application/x-protobuf; Traceparent: 00-ad1a1f8772e005fe24bd3b52ed996223-c66f97adc9041fe7-01; X-Forwarded-For: 192.168.118.225; X-Forwarded-Host: test-dev-loki.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Original-Uri: /loki/api/v1/push; X-Real-Ip: 192.168.118.225; X-Request-Id: da73474c76c25d011e907995a088e8c3; X-Scheme: https; " level=warn ts=2021-08-26T15:16:55.805296566Z caller=logging.go:71 traceID=5023646bd2964fa8 msg="POST /loki/api/v1/push (500) 354.854µs Response: \"context canceled\\n\" ws: false; Connection: close; Content-Length: 2021; Content-Type: application/x-protobuf; Traceparent: 00-a03aba91b53987591c620306a64dd57d-5124651fa44295fa-01; X-Forwarded-For: 192.168.118.225; X-Forwarded-Host: test-dev-loki.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Original-Uri: /loki/api/v1/push; X-Real-Ip: 192.168.118.225; X-Request-Id: e866d6eb9274460cb1e0cff6474f645f; X-Scheme: https; " level=warn ts=2021-08-26T15:17:35.744580609Z caller=logging.go:71 traceID=6e19ffd316688fd9 msg="POST /loki/api/v1/push (500) 707.415µs Response: \"context canceled\\n\" ws: false; Connection: close; Content-Length: 9902; Content-Type: application/x-protobuf; Traceparent: 00-2ff65bbad608465a0ab7321dc51d7186-447d85ccbd1e27c7-01; X-Forwarded-For: 192.168.118.225; X-Forwarded-Host: test-dev-loki.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Original-Uri: /loki/api/v1/push; X-Real-Ip: 192.168.118.225; X-Request-Id: 099cca6e657fa8a40aecdea071f9a126; X-Scheme: https; " level=warn ts=2021-08-26T15:17:53.627853857Z caller=logging.go:71 traceID=6de611da02a822f msg="POST /loki/api/v1/push (500) 1.173128ms Response: \"context canceled\\n\" ws: false; Connection: close; Content-Length: 22224; Content-Type: application/x-protobuf; Traceparent: 00-3d5640a1cf1dc98232e94ffd02c4c85e-2bb9fcddf642b614-01; X-Forwarded-For: 192.168.118.225; X-Forwarded-Host: test-dev-loki.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Forwarded-Scheme: https; X-Original-Uri: /loki/api/v1/push; X-Real-Ip: 192.168.118.225; X-Request-Id: cf1c9837f00c8759772bbc8c32cbdd8e; X-Scheme: https; " ts=2021-08-26T15:18:07.111403777Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: loki-dev-loki-distributed-querier-0-c0f1bba2)"

liguozhong commented 3 years ago

https://github.com/grafana/loki/pull/4238 and https://github.com/grafana/loki/issues/4260

JoeyEamigh commented 2 years ago

was anyone able to find a concrete fix on this?

marcusteixeira commented 2 years ago

Hello guys! I believe this may be correlated with the issue #5177. Where a Pull Request has already been merged to resolve the issue: #5179

This commit has not yet been applied to a release, but I believe this can be validated via the main branch.

step-baby commented 2 years ago

2.5.0 also has this problem

level=warn ts=2022-04-13T03:53:53.820649479Z caller=logging.go:72 traceID=3ddb4bd64dc8aea9 orgID=fake msg="POST /loki/api/v1/push (500) 1.000373991s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 3272; Content-Type: application/x-protobuf; User-Agent: promtail/2.5.0; "
irizzant commented 2 years ago

I'm hitting the same issue here with 2.5.0 Thanks to this reply:

This commit has not yet been applied to a release, but I believe this can be validated via the main branch.

I tried using the main version and I still receive 500 errors:

2022-05-10 11:01:03 +0200 [warn]: #0 failed to write post to http://gateway.loki.svc.cluster.local/loki/api/v1/push (500 Internal Server Error rpc error: code = DeadlineExceeded desc = context deadline exceeded

but it also looks like retries manage to send chunks later:

2022-05-10 11:01:04 +0200 [warn]: #0 retry succeeded. chunk_id="5dea4906279e1b67bd8cd8edb759684c"

I can also see these 500 errors occur in distributors pods after the POST to loki/api/v1/push waits for 2s , so it looks like this is a default timeout.

I can also confirm there are less 500 errors than in version 2.5.0

UPDATE: Adding the below image showing that in my case Consul is showing the highest latency, especially for requests coming from Compactor and Distributor: image

irizzant commented 1 year ago

I'm now in version 2.6.1 and I still see unbelievable latency between Consul and requests coming from Compactor and Distributor: image

I also attach the Distributor latency metrics: image

the Ingester's: image

and Consul's: image