Open wilfriedroset opened 1 year ago
I see two different issues in the example logs.
sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897
Both logs of this type fail the sample comparison for the last one (timestamp of the failed sample matches the query end
parameter), but we also run the comparion from the last sample and then backward, so we don't know how many samples are actually different.
You suggested it could be something similar to https://github.com/grafana/mimir/issues/3764, but Mimir guarantees read-after-write. Since in the mimir-continuous-test
tool first we write and then we read back, read-after-write should be guaranteed (unless bugs).
Does it always happen for range queries (runRangeQueryAndVerifyResult
) or instant queries (runInstantQueryAndVerifyResult
) as well?
sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)
This log means there's 1 minute of gap in the output samples. Again, this shouldn't happen.
Are you running the query-frontend results cache with any custom config?
Does it always happen for range queries (runRangeQueryAndVerifyResult) or instant queries runInstantQueryAndVerifyResult) as well?
We also have occurences with runInstantQueryAndVerifyResult but not in the timerange listed above.
ts=2023-01-22T09:30:07.834706559Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runInstantQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) ts=1674217920000 results_cache=false msg="Instant query result check failed" err="expected 1 series in the result but got 0"
Are you running the query-frontend results cache with any custom config?
frontend: align_queries_with_step: true cache_results: true grpc_client_config: grpc_compression: snappy max_recv_msg_size: 209715200 max_send_msg_size: 419430400 tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem tls_cert_path: [redacted] tls_enabled: true tls_insecure_skip_verify: true tls_key_path: [redacted] tls_min_version: VersionTLS13 log_queries_longer_than: 5s parallelize_shardable_queries: true results_cache: backend: memcached compression: snappy memcached: addresses: dns+mimir-index.memcached.service.consul:11211 max_get_multi_batch_size: 4096 max_get_multi_concurrency: 10000 max_item_size: 0 timeout: 1s scheduler_address: query-scheduler.mimir-grpc.service.consul:9095 frontend_worker: grpc_client_config: grpc_compression: snappy max_recv_msg_size: 209715200 max_send_msg_size: 419430400 tls_ca_path: /var/lib/puppet/ssl/certs/ca.pem tls_cert_path: [redacted] tls_enabled: true tls_insecure_skip_verify: true tls_key_path: [redacted] tls_min_version: VersionTLS13 scheduler_address: query-scheduler.mimir-grpc.service.consul:9095
Commenting to express that we are facing the same issue for some time now on prod and dev-environments. Curious where this goes.
Any updates on this one? Still failing for us intermittently.
I currently don't have the bandwidth to follow up on this. To further debug it, I would try to add more details to the log when the query fail to better understand for how many samples the comparison fail (e.g. is it always the last one or not?).
@pracucci did you have a chance to increase logging output for this? Or shall I just increase the log-level on our end to see what gets printed for the failure-cases?
@henry-kuehl what is the value of align_queries_with_step
in your case?
For us it is true
and we have found that Grafana Cloud is running with false
.
We are testing it now with false
but as the issue is quite sporadic for us, I'm not sure to have a feedback soon.
@wilfriedroset we upgraded to Mimir 2.7.2 two weeks back and now also run with the default value for align_queries_with_step
. I still see the alert firing intermittently and could not observe an immediate change in behaviour. How did you even come to the conclusion that this might have an effect?
We have been trying many things around continuous test without any luck.
The reasoning behind looking at align_queries_with_step
was that the timestamp used by mimir-continuous-test might be slightly modified by mimir if they are not properly aligned. By reading the code we can see that mimir-continuous-test does align itself. We did not expect any changes regarding this setting, we just wanted to rule out this hypothesis.
Any update from any parties on this one? Even in the current Mimir-version we see this one failing intermittently.
Describe the bug
Hello, we are using continuous-test with the mixin to monitor our Mimir clusters. I’m unsure about how to troubleshoot MimirContinuousTestFailed alert for the write-read-series check. We have a couple cases during the last few days (see log messages below)
The alerts come from different cluster monitored by different continuous-test and apart from those there is nothing else in continuous-test logs nor mimir’s logs. If I had to guess I would say this could be related to sample not yet being visible by the queriers something like https://github.com/grafana/mimir/issues/3764
At this point I'm unsure to say it is a bug and I would like to request guidance about how can we troubleshoot such issue?
To Reproduce
At this time I've no succeed in reproducing the error Steps to reproduce the behavior:
Expected behavior
I would expect to see more message in continuous-test or mimir to help the troubleshooting
Environment
Additional Context
ts=2023-01-06T21:25:42.87954582Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=false msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"
ts=2023-01-06T21:25:42.853913842Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673015460000 end=1673030740000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673023640000 (2023-01-06 16:47:20 +0000 UTC) was expected to have timestamp 1673023680000 (2023-01-06 16:48:00 +0000 UTC) because next sample has timestamp 1673023700000 (2023-01-06 16:48:20 +0000 UTC)"
ts=2023-01-08T08:37:17.203808569Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673163420000 end=1673167020000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673167020000 (2023-01-08 08:37:00 +0000 UTC) has value -9505.809885 while was expecting -9510.565167"
ts=2023-01-09T07:15:32.350447487Z caller=spanlogger.go:80 test=write-read-series method=WriteReadSeriesTest.runRangeQueryAndVerifyResult level=warn query=sum(max_over_time(mimir_continuous_test_sine_wave[1s])) start=1673244920000 end=1673248520000 step=20s results_cache=true msg="Range query result check failed" err="sample at timestamp 1673248520000 (2023-01-09 07:15:20 +0000 UTC) has value -2076.414045 while was expecting -2079.116897"