Push metrics to Cloud with correct timestamps

imiric commented 2 years ago

On current main (cf033cb), running the following script:

script.js

```javascript import { check } from 'k6'; import { chromium } from 'k6/x/browser'; export const options = { vus: 1, duration: '60s', } export default function () { const browser = chromium.launch({ headless: true, }); const context = browser.newContext(); const page = context.newPage(); // Goto front page, find login link and click it page.goto('https://test.k6.io/', { waitUntil: 'networkidle' }); Promise.all([ page.waitForNavigation(), page.locator('a[href="/my_messages.php"]').click(), ]).then(() => { // Enter login credentials and login page.locator('input[name="login"]').type('admin'); page.locator('input[name="password"]').type('123'); // We expect the form submission to trigger a navigation, so to prevent a // race condition, setup a waiter concurrently while waiting for the click // to resolve. return Promise.all([ page.waitForNavigation(), page.locator('input[type="submit"]').click(), ]); }).then(() => { check(page, { 'header': page.locator('h2').textContent() == 'Welcome, admin!', }); }).finally(() => { page.close(); browser.close(); }); } ```

... and sending the metrics to the Cloud with:

xk6-browser run --output cloud script.js

Results in the test run ending with a Timed out status, not showing any metric results (the metrics do appear while the test is running, but disappear when it finishes):

2022-09-14-154703_1556x909_scrot

And a strange "Started" timestamp that is ~11 days in the past in this example:

2022-09-14-144430_551x313_scrot

See test run 140998 on staging for details.

After further troubleshooting it seems that some metric samples are being sent with this wrong timestamp, which confuses the backend. I added some debug print statements to k6.output.cloud.MetricsClient.PushMetric():

patch

```diff diff --git a/output/cloud/metrics_client.go b/output/cloud/metrics_client.go index 6aef2076a..4faa66c4b 100644 --- a/output/cloud/metrics_client.go +++ b/output/cloud/metrics_client.go @@ -48,6 +48,18 @@ func (mc *MetricsClient) PushMetric(referenceID string, s []*Sample) error { start := time.Now() url := fmt.Sprintf("%s/v1/metrics/%s", mc.host, referenceID) + for _, sample := range s { + var ts int64 + switch d := sample.Data.(type) { + case *SampleDataSingle: + ts = d.Time + case *SampleDataMap: + ts = d.Time + } + fmt.Printf("pushing %q sample to cloud with timestamp %s\n", + sample.Metric, time.Unix(0, int64(ts*1000))) + + } jsonStart := time.Now() b, err := easyjson.Marshal(samples(s)) if err != nil { ```

Which shows some of the wrong timestamps:

pushing "vus" sample to cloud with timestamp 2022-09-14 15:14:47.180698 +0200 CEST
pushing "vus_max" sample to cloud with timestamp 2022-09-14 15:14:47.180698 +0200 CEST
pushing "data_sent" sample to cloud with timestamp 2022-09-03 06:10:11.780496 +0200 CEST
pushing "http_reqs" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "http_req_duration" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "data_received" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "http_req_connecting" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "http_req_tls_handshaking" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "http_req_sending" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "http_req_receiving" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "data_sent" sample to cloud with timestamp 2022-09-03 06:10:11.780496 +0200 CEST
pushing "data_sent" sample to cloud with timestamp 2022-09-03 06:10:11.899911 +0200 CEST
pushing "browser_first_meaningful_paint" sample to cloud with timestamp 2022-09-14 15:14:47.493954 +0200 CEST
pushing "browser_loaded" sample to cloud with timestamp 2022-09-14 15:14:47.512931 +0200 CEST

Notice that vus and vus_max are correct, but http_req* ones have a timestamp in the past.

This is likely some wrong int conversion for these metric emissions.

mdcruz commented 2 years ago

For more information that might be useful, when I ran a test locally and pushed the output to the cloud, you can see that the timestamp is also wrong in this view.

inancgumus commented 2 years ago

I can reproduce the issue locally. There are problems with NetworkManager request/response metric emissions and the way CDP handles timestamps. Using the test script @mdcruz shared, I see the request/response metrics are four days behind on my machine.

onResponseReceived == 2022-09-16 19:56:13.236125 +0300 +03
response.timestamp   =2022-09-16 19:56:13.236125 +0300 +03
emitResponseMetrics   2022-09-16 19:56:13.236125 +0300 +03
emitResponseMetrics 2 2022-09-16 19:56:13.236125 +0300 +03

handleRequestRedirect == 2022-09-16 19:56:15.219125 +0300 +03
   response.timestamp   =2022-09-16 19:56:15.219125 +0300 +03
   emitResponseMetrics   2022-09-16 19:56:15.219125 +0300 +03
   emitResponseMetrics 2 2022-09-16 19:56:15.219125 +0300 +03

onResponseReceived == 2022-09-16 19:56:15.359375 +0300 +03
response.timestamp   =2022-09-16 19:56:15.359375 +0300 +03
emitResponseMetrics   2022-09-16 19:56:15.359375 +0300 +03
emitResponseMetrics 2 2022-09-16 19:56:15.359375 +0300 +03

Why are CDP request/response timestamps not correct?

The cdp package (and the cdp protocol) uses the monotonic time when calculating timestamps. And the cdp package does so by getting it from the local machine's last boot time. That's why there are time skews.

onRequest            : 2022-09-17 00:30:56.545 +0300 +03    -> CDP.Timestamp
monotonic epoch      : 2022-09-15 14:54:29 +0300            -> BOOT TIME
timestamp            : 2022-09-17 00:27:10.467875 +0300     -> CDP TIMESTAMP
walltime             : 2022-09-20 16:38:21.063684096 +0300  -> CDP WALLTIME
time.Now             : 2022-09-20 16:38:21.063926 +0300

PS: Some of them are now 09-17 because the time flies (meaning I've been working on this for the whole day)

Solution

Since the cdp package uses monotonic time in timestamp fields, we need to adjust the timestamp with the monotonic difference.

// walltime returns the machine's current time when the request is first received
off := event.walltime - event.timestamp
event.timestamp += off

Note: See this to understand why the above calculation works.

With this change, the timestamps become accurate. However, there is not much difference between calculating the offset using the above method and simply getting it from the wall time (see the "Why are CDP request/response timestamps not correct?" output above).

The only difference is that the wall time is not monotonic. So it's sensitive to machine clock changes (maybe not 🤷 it's because Chromium calculates the current time using its monotic clock, see this — this one is used by the cdp protocol and for the walltime field).

I'll continue working on this tomorrow.

@imiric @ankur22 What do you think about calculating the offset vs. simply using the walltime field?

ankur22 commented 2 years ago

@imiric @ankur22 What do you think about calculating the offset vs. simply using the walltime field?

Either of them are as good as each other since both solutions use the walltime and so are both affected by any machine clock changes AFAICT.

Maybe the delta solution is "better". When using walltime to calculate the delta, would it be calculated once (i assume at the start of the test) and it then reused throughout the test?

inancgumus commented 2 years ago

@ankur22 That's why we're calculating the offset using the monotonic time to adjust the timestamp for machine clock changes. Since walltime and timestamp is per-request, they are consistent (relatively between themselves).

When using walltime to calculate the delta, would it be calculated once (i assume at the start of the test) and it then reused throughout the test?

The walltime offset calculation is done per request so that it's specific to a single request (and its response). That's how it will prevent time skews.

How frequently a local time on a cloud instance will change? I guess it will be super infrequent. So we can simply use the walltime field (without calculating the offset), and I think it will be enough for 99% of cases.

WDYT?

ankur22 commented 2 years ago

How frequently a local time on a cloud instance will change? I guess it will be super infrequent. So we can simply use the walltime field (without calculating the offset), and I think it will be enough for 99% of cases.

I'm happy with this approach. See what @imiric thinks.

imiric commented 2 years ago

Nice work digging into this @inancgumus :clap: I wasn't aware CDP events had both a walltime and monotonic timestamp.

What do you think about calculating the offset vs. simply using the walltime field?

I'm confused by the need to calculate using the offset to arrive at the correct time. Is there a benefit of doing that vs. just using the walltime field directly?

It seems to me like you'd always get the same result. I.e.:

c = a - b
b += c
b == a

So we might just use event.walltime without doing this, AFAICS needless, calculation.

From what I understand, monotonic time is useful for calculating intervals between samples with greater precision. So if we need to calculate the elapsed time between two events, we should use the monotonic timestamp. Because the metric samples we send to the backend require a Unix timestamp, we have no choice but to use the wall time, which depends on the system clock. I'm not sure about all the ways the backend takes this into account, but what was messing things up here is that the timestamps were inconsistent within a single test run. So as long as that's fixed, we should be fine with using event.walltime directly.

BTW,

How frequently a local time on a cloud instance will change?

k6 run --output cloud doesn't use a Cloud instance like k6 cloud does. It runs the test on your local machine, and uses the cloud output to send the metrics to the Cloud. I doubt that the backend does any time transformations on the received timestamps, or that the time on the ingest machines would affect this. I think they just take the received timestamps as is and extract things like test creation and start time based on what they receive. We can always confirm with the backend team if needed.

inancgumus commented 2 years ago

Thanks, Ivan 😄

I'm confused by the need to calculate using the offset to arrive at the correct time. Is there a benefit of doing that vs. just using the walltime field directly?

We will calculate offset once in NetworkManager.onRequest, and then we'll add it to resp.timestamp in emitResponseMetrics. So event.timestamp in the above calculation is not response's timestamp.

off := onRequestEvent.walltime - onRequestEvent.timestamp
// later in the request/response chain
response.timestamp += off

...monotonic time is useful for calculating intervals between samples with greater precision.

Monotonic time is for preventing time-skew when measuring elapsed time between two time intervals. And, timestamp fields in Request/Response are monotonic (measured since the boot time of the machine). Only the Request.Walltime is not monotonic.

k6 run --output cloud doesn't use a Cloud instance like k6 cloud does. It runs the test on your local machine, and uses the cloud output to send the metrics to the Cloud.

That's a good point. So if we directly use walltime, time skew can happen. So we need to take into account offset.

Thinking aloud: Go also uses monotonic time, and time is accurately calculated even a time skew happens. That means we can create a Time value in the onRequest handler using either the Walltime field or time.Now(), calculate offset there, and then add offset in the Response metric emission.

Anyway, I'll look more into this and come up with a solution. Thanks for your input 🙇

inancgumus commented 2 years ago

I ran some tests on staging, and the test ID of the last one is 141146. The results look promising.

grafana / xk6-browser

Push metrics to Cloud with correct timestamps #533