grafana / xk6-browser

The browser module adds support for browser automation and end-to-end web testing via the Chrome Devtools Protocol to k6.
https://grafana.com/docs/k6/latest/javascript-api/k6-browser/
GNU Affero General Public License v3.0
343 stars 41 forks source link

Event loop state corruption(?) #510

Closed sniku closed 2 years ago

sniku commented 2 years ago

I'm struggling to come up with a good name for the issue because the underlying bug is not clear to me.

Version

Build using the latest main branch 3a1e01b.

➜  ~ xk6-browser4 version
k6 v0.39.0 ((devel), go1.18.5, linux/amd64)

Script

import { check } from 'k6';
import { Counter } from 'k6/metrics';
import { chromium } from 'k6/x/browser';

const checkpoint1 = new Counter('checkpoint1');
const checkpoint2 = new Counter('checkpoint2');
const checkpoint3 = new Counter('checkpoint3');
const checkpoint4 = new Counter('checkpoint4');
const checkpoint5 = new Counter('checkpoint5');
const checkpoint6 = new Counter('checkpoint6');
const checkpoint7 = new Counter('checkpoint7');

export let options = {
  vus: 1,
  duration: '50m'
}

export default function () {
  const browser = chromium.launch({
    headless: false,
    // executablePath: '/usr/bin/google-chrome-stable',
    executablePath: '/usr/bin/chromium',
  });
  const context = browser.newContext();
  const page = context.newPage();

  checkpoint1.add(1);

  page.goto('https://test.k6.io/', { waitUntil: 'networkidle' });

  checkpoint2.add(1);

  Promise.all([
    page.waitForNavigation(),
    page.locator('a[href="/my_messages.php"]').click(),
  ]).then(() => {
    checkpoint3.add(1)

    page.locator('input[name="login"]').type('admin');
    page.locator('input[name="password"]').type('123');

    checkpoint4.add(1)

    return Promise.all([
      page.waitForNavigation(),
      page.locator('input[type="submit"]').click(),
    ]);
  }).then(() => {
    checkpoint5.add(1)

    check(page, {
      'header': page.locator('h2').textContent() == 'Welcome, admin!',
    });
  }).finally(() => {
    checkpoint6.add(1)
    page.close();
    browser.close();
    checkpoint7.add(1)
  });
}

Result

On a local machine, the test usually runs correctly for a long time. In this case, it was running for about 40 min. In the cloud, the failure happens much quicker - usually within 1-2 min.

The browser behavior is surprising.

  1. The test keeps running. New browser windows are being opened as usual.
  2. The navigation to the first page page.goto('https://test.k6.io/', { waitUntil: 'networkidle' }); works fine.
  3. The browser closes immediately after that. The entire Promise.all([ block seems to be skipped.
  4. The following lines are printed.
    ERRO[2588] communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF  category=cdp elapsed="0 ms" goroutine=352118
    ERRO[2590] Uncaught (in promise) disposing browser context: disposing browser context ID 4440CA8EF7EB8FA34EC94204F4C82584: websocket: close 1006 (abnormal closure): unexpected EOF  executor=constant-vus scenario=default
    ERRO[2593] Uncaught (in promise) disposing browser context: disposing browser context ID 4440CA8EF7EB8FA34EC94204F4C82584: websocket: close 1006 (abnormal closure): unexpected EOF  executor=constant-vus scenario=default

From that point, none of the code in the Promise.all([ block is executed, but the test keeps running and repeatedly printing the ERRO lines shown above.

image ... image

mdcruz commented 2 years ago

I was also able to reproduce this with the following options:

export const options = {
  stages: [
    { duration: '10m', target: 30 },
    { duration: '10m', target: 30 },
    { duration: '10m', target: 5 }
  ],
  thresholds: {
    'http_req_duration{load_zone:amazon:us:ashburn}': ['p(95)<500'],
    'http_req_duration{load_zone:amazon:ie:dublin}': ['p(95)<800'],
  },
  ext: {
    loadimpact: {
      projectID: 3026,
      distribution: {
        distributionLabel1: { loadZone: 'amazon:us:ashburn', percent: 50 },
        distributionLabel2: { loadZone: 'amazon:ie:dublin', percent: 50 }
      },
      note: 'Marie test'
    }
  }
}
imiric commented 2 years ago

This is now fixed thanks to @mstoykov in https://github.com/grafana/k6/pull/2693. I tested with a 1h/10VU test run on staging, and Mihail did the same on production, and we didn't run into this issue anymore.

There's still an outstanding issue of fixing what triggers the Promise rejection to begin with, but the event loop will no longer be "corrupted" if it's unhandled, which is what this issue was about. We can open a new lower priority issue for the trigger when we run into it.