grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.14k stars 532 forks source link

Help request to troubleshoot MimirContinuousTestFailed alert #3931

Open wilfriedroset opened 1 year ago

wilfriedroset commented 1 year ago

Describe the bug

Hello, we are using continuous-test with the mixin to monitor our Mimir clusters. I’m unsure about how to troubleshoot MimirContinuousTestFailed alert for the write-read-series check. We have a couple cases during the last few days (see log messages below)

The alerts come from different cluster monitored by different continuous-test and apart from those there is nothing else in continuous-test logs nor mimir’s logs. If I had to guess I would say this could be related to sample not yet being visible by the queriers something like https://github.com/grafana/mimir/issues/3764

At this point I'm unsure to say it is a bug and I would like to request guidance about how can we troubleshoot such issue?

To Reproduce

At this time I've no succeed in reproducing the error Steps to reproduce the behavior:

  1. Start Mimir 2.5.0
  2. Monitor it with continuous-test

Expected behavior

I would expect to see more message in continuous-test or mimir to help the troubleshooting

Environment

Additional Context

ts=2023-01-06T21:25:42.87954582Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=false msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"

ts=2023-01-06T21:25:42.853913842Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"

ts=2023-01-08T08:37:17.203808569Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673163420000 end=1673167020000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673167020000 (2023-01-08 08:37:00 +0000 UTC) has value -9505.809885 while was expecting -9510.565167"

ts=2023-01-09T07:15:32.350447487Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673244920000 end=1673248520000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897"

pracucci commented 1 year ago

I see two different issues in the example logs.

sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897

Both logs of this type fail the sample comparison for the last one (timestamp of the failed sample matches the query end parameter), but we also run the comparion from the last sample and then backward, so we don't know how many samples are actually different.

You suggested it could be something similar to https://github.com/grafana/mimir/issues/3764, but Mimir guarantees read-after-write. Since in the mimir-continuous-test tool first we write and then we read back, read-after-write should be guaranteed (unless bugs).

Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries (runInstantQueryAndVerifyResult) as well?

sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)

This log means there's 1 minute of gap in the output samples. Again, this shouldn't happen.

Are you running the query-frontend results cache with any custom config?

wilfriedroset commented 1 year ago

Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries runInstantQueryAndVerifyResult) as well?

We also have occurences with runInstantQueryAndVerifyResult but not in the timerange listed above.

ts=2023-01-22T09:30:07.834706559Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runInstantQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) ts=1674217920000 results_cache=false msg="Instant query result check failed" err="expected 1 series in the result but got 0"

Are you running the query-frontend results cache with any custom config?

frontend:
align_queries_with_step: true
cache_results: true
grpc_client_config:
grpc_compression: snappy
max_recv_msg_size: 209715200
max_send_msg_size: 419430400
tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem
tls_cert_path: [redacted]
tls_enabled: true
tls_insecure_skip_verify: true
tls_key_path: [redacted]
tls_min_version: VersionTLS13
log_queries_longer_than: 5s
parallelize_shardable_queries: true
results_cache:
backend: memcached
compression: snappy
memcached:
addresses: dns+mimir-index.memcached.service.consul:11211
max_get_multi_batch_size: 4096
max_get_multi_concurrency: 10000
max_item_size: 0
timeout: 1s
scheduler_address: query-scheduler.mimir-grpc.service.consul:9095
frontend_worker:
grpc_client_config:
grpc_compression: snappy
max_recv_msg_size: 209715200
max_send_msg_size: 419430400
tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem
tls_cert_path: [redacted]
tls_enabled: true
tls_insecure_skip_verify: true
tls_key_path: [redacted]
tls_min_version: VersionTLS13
scheduler_address: query-scheduler.mimir-grpc.service.consul:9095
henry-kuehl commented 1 year ago

Commenting to express that we are facing the same issue for some time now on prod and dev-environments. Curious where this goes.

henry-kuehl commented 1 year ago

Any updates on this one? Still failing for us intermittently.

pracucci commented 1 year ago

I currently don't have the bandwidth to follow up on this. To further debug it, I would try to add more details to the log when the query fail to better understand for how many samples the comparison fail (e.g. is it always the last one or not?).

henry-kuehl commented 1 year ago

@pracucci did you have a chance to increase logging output for this? Or shall I just increase the log-level on our end to see what gets printed for the failure-cases?

wilfriedroset commented 1 year ago

@henry-kuehl what is the value of align_queries_with_step in your case? For us it is true and we have found that Grafana Cloud is running with false. We are testing it now with false but as the issue is quite sporadic for us, I'm not sure to have a feedback soon.

See: https://github.com/grafana/mimir/commit/7f0354c413

henry-kuehl commented 1 year ago

@wilfriedroset we upgraded to Mimir 2.7.2 two weeks back and now also run with the default value for align_queries_with_step. I still see the alert firing intermittently and could not observe an immediate change in behaviour. How did you even come to the conclusion that this might have an effect?

wilfriedroset commented 1 year ago

We have been trying many things around continuous test without any luck. The reasoning behind looking at align_queries_with_step was that the timestamp used by mimir-continuous-test might be slightly modified by mimir if they are not properly aligned. By reading the code we can see that mimir-continuous-test does align itself. We did not expect any changes regarding this setting, we just wanted to rule out this hypothesis.

henry-kuehl commented 1 year ago

Any update from any parties on this one? Even in the current Mimir-version we see this one failing intermittently.