grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.92k stars 3.45k forks source link

[Loki] Incorrect server block configuration does not result in application failure #6528

Open bschurig opened 2 years ago

bschurig commented 2 years ago

Describe the bug

Setting http_server_write_timeout or http_server_read_timeout to a number without a duration does not result in an error. When running on kubernetes the pod will start but never becomes healthy as the readiness probes will fail. The application does not log anything indicating that there is a misconfiguration of the server when the log level is set to info.

To Reproduce

Incorrect Config: Set the value of the timeout as an int instead of a string with a duration:

  config: |
    auth_enabled: false
    server:
      log_level: info
      http_listen_port: 3100
      http_server_read_timeout: 910 # values should have a duration
      http_server_write_timeout: 910
      http_server_idle_timeout: 180

Valid Config: Set the value of the timeout with a duration:

  config: |
    auth_enabled: false
    server:
      log_level: info
      http_listen_port: 3100
      http_server_read_timeout: 910s
      http_server_write_timeout: 910s
      http_server_idle_timeout: 180s
  1. Started Loki (SHA or version): v2.5.0 Using the distributed helm chart.

Expected behavior

The expected behavior when the server block is misconfigured is that the application throws an error and stops, logging the exception so that it can be addressed.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

When looking at the logs of two different replicas, it's impossible to tell which has the incorrect configuration. If you look at the extended logs of the misconfigured replica the only difference is that there are logs from failed scrapes to /metrics

Pod logs from a failing distributor replica: image

Pod logs from a distributor replica after the configuration has been corrected (the warning from the other screen shot had been addressed in this example): image

DylanGuedes commented 2 years ago

Hey, thanks for reporting this. It does work because when you don't specify the duration fully, it infer it is in nanoseconds. That said, changing that would be a breaking change but it is cool to be more verbose regarding a configuration not being very appropriate.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.