Prometheus remote write: timestamp too old

JasmineCA commented 3 months ago

Brief summary

Hello folks of k6,

I've been using k6 for some months and it's a great tool, even if we still lack useful features (but I'm sure they will come soon). We are using prometheus-rw feature to push metrics to a Grafana Mimir self-hosted instance. From time to time (actually very often), k6 tests returns this error: ERRO[0007] Failed to send the time series data to the endpoint error="got status code: 400 instead expected a 2xx successful status code" output="Prometheus remote write". When we check on the Mimir side, we can see logs like this: ts=2024-06-03T12:05:33.265292141Z caller=push.go:130 level=error user=***** msg="push error" err="rpc error: code = Code(400) desc = failed pushing to ingester: user=******: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 1970-01-01T00:00:00Z and is from series {__name__=\"k6_http_req_connecting_seconds\", expected_response=\"true\", method=\"POST\", name=\"*******", proto=\"HTTP/1.1\", scenario=\"create_merchant_with_store\", status=\"201\", test_suite_id=\"create-merchant-then-store-loadtest-it-2024-06-03-11:49:03\", testid=\"create_merchant_with_store-2024-06-03-11:49:07\", url=\"******}" (sanitized log)

It seems that the timestamp is not set when the metric is sent? You can see with the test_suite_id attribute that the test has been run this morning, so the timestamp Mimir got is indeed incorrect.

Do you have any input to avoid these kind of bugs?

Regards,

k6 version

0.50.0

OS

Windows under WSL

Docker version and image (if applicable)

No response

Steps to reproduce the problem

Deploys a self hosted Mimir (or maybe skip this part if it works with any prometheus server)
Runs a test with the experimental-prometheus-rw flag
Waits until you got unlucky and have errors pushing metrics

Expected behaviour

Metrics are sent 100% of the time

Actual behaviour

Metrics are not pushed randomly because of a timestamp error

Rbillon59 commented 3 months ago

Hello,

I faced the same issue, it was because I had multiple k6 instance running in parallel and the generated prometheus metrics had the same label (same cardinalities).

The solution I found was to add a tag scenario (could be any tag, it just need to be unique among each k6 instance running), so the generated prometheus metric can.

Like :

k6 run --log-format json --no-summary --quiet "my-scenario.js" -o experimental-prometheus-rw --tag scenario=my-scenario

JasmineCA commented 3 months ago

Hello,

Thank you for your comment. It might become useful for us once we will upgrade to multiple k6 instances running the same scenario. Unfortunately, I face this issue with only one k6 instance running at the same time. Even if they were multiple instances, I have a tag built from a timestamp so the cardinality would be different. But it's a strange issue you faced, because I though prometheus metrics would be a way to reunite metrics of separate k6 instances running the same scenario.

codebien commented 3 months ago

Hey @JasmineCA, are you running k6 in a configuration with multiple instances as @Rbillon59 mentioned? For example, running k6-operator?

codebien commented 3 months ago

prometheus metrics would be a way to reunite metrics of separate k6 instances running the same scenario.

Yes, but it generally does by applying an instance label and this is what we suggest also for k6 with multiple instances configuration.

The affected sample has timestamp 1970-01-01T00:00:00Z

Btw, I see only now that the wrong timestamp is 1970-... which means time zero for Go. There is a high chance that this is a bug on our side.

Hopefully, I will find the time to fix it at the begin of the next week.

JasmineCA commented 2 months ago

Hello @codebien ,

Did you have the time to check where in the code this bug could be?

grafana / k6