Help request to troubleshoot MimirContinuousTestFailed alert

wilfriedroset commented 1 year ago

Describe the bug

Hello, we are using continuous-test with the mixin to monitor our Mimir clusters. I’m unsure about how to troubleshoot MimirContinuousTestFailed alert for the write-read-series check. We have a couple cases during the last few days (see log messages below)

The alerts come from different cluster monitored by different continuous-test and apart from those there is nothing else in continuous-test logs nor mimir’s logs. If I had to guess I would say this could be related to sample not yet being visible by the queriers something like https://github.com/grafana/mimir/issues/3764

At this point I'm unsure to say it is a bug and I would like to request guidance about how can we troubleshoot such issue?

To Reproduce

At this time I've no succeed in reproducing the error Steps to reproduce the behavior:

Start Mimir 2.5.0
Monitor it with continuous-test

Expected behavior

I would expect to see more message in continuous-test or mimir to help the troubleshooting

Environment

Infrastructure: bare-metal
Deployment tool: debian package

Additional Context

ts=2023-01-06T21:25:42.87954582Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=false msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"

ts=2023-01-06T21:25:42.853913842Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"

ts=2023-01-08T08:37:17.203808569Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673163420000 end=1673167020000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673167020000 (2023-01-08 08:37:00 +0000 UTC) has value -9505.809885 while was expecting -9510.565167"

ts=2023-01-09T07:15:32.350447487Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673244920000 end=1673248520000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897"

pracucci commented 1 year ago

I see two different issues in the example logs.

sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897

Both logs of this type fail the sample comparison for the last one (timestamp of the failed sample matches the query end parameter), but we also run the comparion from the last sample and then backward, so we don't know how many samples are actually different.

You suggested it could be something similar to https://github.com/grafana/mimir/issues/3764, but Mimir guarantees read-after-write. Since in the mimir-continuous-test tool first we write and then we read back, read-after-write should be guaranteed (unless bugs).

Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries (runInstantQueryAndVerifyResult) as well?

sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)

This log means there's 1 minute of gap in the output samples. Again, this shouldn't happen.

Are you running the query-frontend results cache with any custom config?

wilfriedroset commented 1 year ago

Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries runInstantQueryAndVerifyResult) as well?

We also have occurences with runInstantQueryAndVerifyResult but not in the timerange listed above.

ts=2023-01-22T09:30:07.834706559Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runInstantQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) ts=1674217920000 results_cache=false msg="Instant query result check failed" err="expected 1 series in the result but got 0"

Are you running the query-frontend results cache with any custom config?

frontend:
align_queries_with_step: true
cache_results: true
grpc_client_config:
grpc_compression: snappy
max_recv_msg_size: 209715200
max_send_msg_size: 419430400
tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem
tls_cert_path: [redacted]
tls_enabled: true
tls_insecure_skip_verify: true
tls_key_path: [redacted]
tls_min_version: VersionTLS13
log_queries_longer_than: 5s
parallelize_shardable_queries: true
results_cache:
backend: memcached
compression: snappy
memcached:
addresses: dns+mimir-index.memcached.service.consul:11211
max_get_multi_batch_size: 4096
max_get_multi_concurrency: 10000
max_item_size: 0
timeout: 1s
scheduler_address: query-scheduler.mimir-grpc.service.consul:9095
frontend_worker:
grpc_client_config:
grpc_compression: snappy
max_recv_msg_size: 209715200
max_send_msg_size: 419430400
tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem
tls_cert_path: [redacted]
tls_enabled: true
tls_insecure_skip_verify: true
tls_key_path: [redacted]
tls_min_version: VersionTLS13
scheduler_address: query-scheduler.mimir-grpc.service.consul:9095