Closed imiric closed 2 years ago
For more information that might be useful, when I ran a test locally and pushed the output to the cloud, you can see that the timestamp is also wrong in this view.
I can reproduce the issue locally. There are problems with NetworkManager
request/response metric emissions and the way CDP handles timestamps. Using the test script @mdcruz shared, I see the request/response metrics are four days behind on my machine.
onResponseReceived == 2022-09-16 19:56:13.236125 +0300 +03
response.timestamp =2022-09-16 19:56:13.236125 +0300 +03
emitResponseMetrics 2022-09-16 19:56:13.236125 +0300 +03
emitResponseMetrics 2 2022-09-16 19:56:13.236125 +0300 +03
handleRequestRedirect == 2022-09-16 19:56:15.219125 +0300 +03
response.timestamp =2022-09-16 19:56:15.219125 +0300 +03
emitResponseMetrics 2022-09-16 19:56:15.219125 +0300 +03
emitResponseMetrics 2 2022-09-16 19:56:15.219125 +0300 +03
onResponseReceived == 2022-09-16 19:56:15.359375 +0300 +03
response.timestamp =2022-09-16 19:56:15.359375 +0300 +03
emitResponseMetrics 2022-09-16 19:56:15.359375 +0300 +03
emitResponseMetrics 2 2022-09-16 19:56:15.359375 +0300 +03
Why are CDP request/response timestamps not correct?
The cdp package (and the cdp protocol) uses the monotonic time when calculating timestamps. And the cdp package does so by getting it from the local machine's last boot time. That's why there are time skews.
onRequest : 2022-09-17 00:30:56.545 +0300 +03 -> CDP.Timestamp
monotonic epoch : 2022-09-15 14:54:29 +0300 -> BOOT TIME
timestamp : 2022-09-17 00:27:10.467875 +0300 -> CDP TIMESTAMP
walltime : 2022-09-20 16:38:21.063684096 +0300 -> CDP WALLTIME
time.Now : 2022-09-20 16:38:21.063926 +0300
PS: Some of them are now 09-17 because the time flies (meaning I've been working on this for the whole day)
Solution
Since the cdp package uses monotonic time in timestamp fields, we need to adjust the timestamp with the monotonic difference.
// walltime returns the machine's current time when the request is first received
off := event.walltime - event.timestamp
event.timestamp += off
Note: See this to understand why the above calculation works.
With this change, the timestamps become accurate. However, there is not much difference between calculating the offset using the above method and simply getting it from the wall time (see the "Why are CDP request/response timestamps not correct?" output above).
The only difference is that the wall time is not monotonic. So it's sensitive to machine clock changes (maybe not 🤷 it's because Chromium calculates the current time using its monotic clock, see this — this one is used by the cdp protocol and for the walltime field).
I'll continue working on this tomorrow.
@imiric @ankur22 What do you think about calculating the offset vs. simply using the walltime field?
@imiric @ankur22 What do you think about calculating the offset vs. simply using the walltime field?
Either of them are as good as each other since both solutions use the walltime
and so are both affected by any machine clock changes AFAICT.
Maybe the delta solution is "better". When using walltime
to calculate the delta, would it be calculated once (i assume at the start of the test) and it then reused throughout the test?
@ankur22 That's why we're calculating the offset using the monotonic time to adjust the timestamp for machine clock changes. Since walltime
and timestamp
is per-request, they are consistent (relatively between themselves).
When using walltime to calculate the delta, would it be calculated once (i assume at the start of the test) and it then reused throughout the test?
The walltime offset calculation is done per request so that it's specific to a single request (and its response). That's how it will prevent time skews.
How frequently a local time on a cloud instance will change? I guess it will be super infrequent. So we can simply use the walltime
field (without calculating the offset), and I think it will be enough for 99% of cases.
WDYT?
How frequently a local time on a cloud instance will change? I guess it will be super infrequent. So we can simply use the walltime field (without calculating the offset), and I think it will be enough for 99% of cases.
I'm happy with this approach. See what @imiric thinks.
Nice work digging into this @inancgumus :clap: I wasn't aware CDP events had both a walltime and monotonic timestamp.
What do you think about calculating the offset vs. simply using the walltime field?
I'm confused by the need to calculate using the offset to arrive at the correct time. Is there a benefit of doing that vs. just using the walltime
field directly?
It seems to me like you'd always get the same result. I.e.:
c = a - b
b += c
b == a
So we might just use event.walltime
without doing this, AFAICS needless, calculation.
From what I understand, monotonic time is useful for calculating intervals between samples with greater precision. So if we need to calculate the elapsed time between two events, we should use the monotonic timestamp. Because the metric samples we send to the backend require a Unix timestamp, we have no choice but to use the wall time, which depends on the system clock. I'm not sure about all the ways the backend takes this into account, but what was messing things up here is that the timestamps were inconsistent within a single test run. So as long as that's fixed, we should be fine with using event.walltime
directly.
BTW,
How frequently a local time on a cloud instance will change?
k6 run --output cloud
doesn't use a Cloud instance like k6 cloud
does. It runs the test on your local machine, and uses the cloud
output to send the metrics to the Cloud. I doubt that the backend does any time transformations on the received timestamps, or that the time on the ingest machines would affect this. I think they just take the received timestamps as is and extract things like test creation and start time based on what they receive. We can always confirm with the backend team if needed.
Thanks, Ivan 😄
I'm confused by the need to calculate using the offset to arrive at the correct time. Is there a benefit of doing that vs. just using the walltime field directly?
We will calculate offset
once in NetworkManager.onRequest
, and then we'll add it to resp.timestamp
in emitResponseMetrics
. So event.timestamp
in the above calculation is not response's timestamp.
off := onRequestEvent.walltime - onRequestEvent.timestamp
// later in the request/response chain
response.timestamp += off
...monotonic time is useful for calculating intervals between samples with greater precision.
Monotonic time is for preventing time-skew when measuring elapsed time between two time intervals. And, timestamp
fields in Request
/Response
are monotonic (measured since the boot time of the machine). Only the Request.Walltime
is not monotonic.
k6 run --output cloud doesn't use a Cloud instance like k6 cloud does. It runs the test on your local machine, and uses the cloud output to send the metrics to the Cloud.
That's a good point. So if we directly use walltime
, time skew can happen. So we need to take into account offset
.
Thinking aloud: Go also uses monotonic time, and time is accurately calculated even a time skew happens. That means we can create a Time
value in the onRequest
handler using either the Walltime
field or time.Now()
, calculate offset
there, and then add offset
in the Response
metric emission.
Anyway, I'll look more into this and come up with a solution. Thanks for your input 🙇
I ran some tests on staging, and the test ID of the last one is 141146. The results look promising.
On current
main
(cf033cb), running the following script:script.js
```javascript import { check } from 'k6'; import { chromium } from 'k6/x/browser'; export const options = { vus: 1, duration: '60s', } export default function () { const browser = chromium.launch({ headless: true, }); const context = browser.newContext(); const page = context.newPage(); // Goto front page, find login link and click it page.goto('https://test.k6.io/', { waitUntil: 'networkidle' }); Promise.all([ page.waitForNavigation(), page.locator('a[href="/my_messages.php"]').click(), ]).then(() => { // Enter login credentials and login page.locator('input[name="login"]').type('admin'); page.locator('input[name="password"]').type('123'); // We expect the form submission to trigger a navigation, so to prevent a // race condition, setup a waiter concurrently while waiting for the click // to resolve. return Promise.all([ page.waitForNavigation(), page.locator('input[type="submit"]').click(), ]); }).then(() => { check(page, { 'header': page.locator('h2').textContent() == 'Welcome, admin!', }); }).finally(() => { page.close(); browser.close(); }); } ```... and sending the metrics to the Cloud with:
Results in the test run ending with a
Timed out
status, not showing any metric results (the metrics do appear while the test is running, but disappear when it finishes):And a strange "Started" timestamp that is ~11 days in the past in this example:
See test run 140998 on staging for details.
After further troubleshooting it seems that some metric samples are being sent with this wrong timestamp, which confuses the backend. I added some debug print statements to
k6.output.cloud.MetricsClient.PushMetric()
:patch
```diff diff --git a/output/cloud/metrics_client.go b/output/cloud/metrics_client.go index 6aef2076a..4faa66c4b 100644 --- a/output/cloud/metrics_client.go +++ b/output/cloud/metrics_client.go @@ -48,6 +48,18 @@ func (mc *MetricsClient) PushMetric(referenceID string, s []*Sample) error { start := time.Now() url := fmt.Sprintf("%s/v1/metrics/%s", mc.host, referenceID) + for _, sample := range s { + var ts int64 + switch d := sample.Data.(type) { + case *SampleDataSingle: + ts = d.Time + case *SampleDataMap: + ts = d.Time + } + fmt.Printf("pushing %q sample to cloud with timestamp %s\n", + sample.Metric, time.Unix(0, int64(ts*1000))) + + } jsonStart := time.Now() b, err := easyjson.Marshal(samples(s)) if err != nil { ```Which shows some of the wrong timestamps:
Notice that
vus
andvus_max
are correct, buthttp_req*
ones have a timestamp in the past.This is likely some wrong int conversion for these metric emissions.