grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.04k stars 1.24k forks source link

Data race in event loop #2612

Closed inancgumus closed 2 years ago

inancgumus commented 2 years ago

Brief summary

  1. CI failed with a data race issue related to the EventLoop type.
  2. We can't run promises concurrently.

k6 version

0.39.0

OS

macOS 12.3.1

Docker version and image (if applicable)

No response

Steps to reproduce the problem

A similar problem can be reproduced like the following.

We want to wait for a navigation event using WaitForFrameNavigation. And simultaneously, we want to issue a Click. But, it doesn't work in this test 😒 We couldn't get both Click, and WaitForNavigation concurrently run.

We use a helper for promises. Here, we launch a goroutine to resolve/reject a promise and return the promise like this. Then we use it, for example, in Click, like this. Or, in WaitForNavigation, like this.

:+1: Surprisingly, it works in this test script.

Expected behaviour

We expect no data race issues when we want to run promises concurrently.

While trying to solve the issue, I noticed something in the event loop code. Shouldn't this e.queue be protected by the e.lock mutex as follows?

func (e *EventLoop) Start(firstCallback func() error) error {
    e.lock.Lock()
    e.queue = []func() error{firstCallback}
    e.lock.Unlock()
    ...
}     

Actual behaviour

Data race issues.

na-- commented 2 years ago

Can you provide an example that fails with a data race, but isn't encumbered by all of the xk6-browser complexity? I am completely out of the loop on what your architecture and concurrency models are, so it will take me ages to trace what's going on, sorry :disappointed:

Starting from the other direction though, EventLoop.Start(), is fairly straightforward: https://github.com/grafana/k6/blob/2cb1315424681c33c54158c31b52d68354aaefcf/js/eventloop/eventloop.go#L101-L107

I am almost (but not 100%) certain the problem is not in EventLoop.Start() and the fact that it doesn't have locking is actually a benefit in this case, since the data race there has likely surfaced another problem in your code :thinking: Consider how EventLoop.Start() is literally used in only 2 places in the whole k6 codebase, here in the VU init code and here: https://github.com/grafana/k6/blob/2cb1315424681c33c54158c31b52d68354aaefcf/js/runner.go#L793-L796

As you can see, firstCallback is literally the script iteration function, i.e. the default function or whatever you have specified in the scenario.exec property. The fact that your code races with this means that you probably somehow touched the EventLoop.queue in between iterations? Which in k6 is a big no-no, for now :sweat_smile: The task queue is supposed to be completely empty between iterations, or rather, an iteration ends when the queue is empty and no more tasks can be added to it: https://github.com/grafana/k6/blob/2cb1315424681c33c54158c31b52d68354aaefcf/js/eventloop/eventloop.go#L15-L21

While there might be bugs in this code, it's pretty fresh after all, I don't think the lack of locking around the queue initialization with firstCallback is one of them. I think it's more likely that you're doing something wrong, since it's very tricky to actually use the current low-level EventLoop primitives correctly in complicated scenarios :disappointed: We intend to improve this in the near future, see https://github.com/grafana/k6/issues/2544 for details. You can try using @mstoykov's prototype of the proposed solution in that issue as a normal Go lib in xk6-browser even now though, it's much harder to misuse than the current k6 low-level API and might simplify your life: https://github.com/MStoykov/k6-taskqueue-lib

na-- commented 2 years ago

I'll close this, since it was most likely triggered by a misuse of the current k6 event loop APIs, which I hope to have addressed in some Slack discussions with @inancgumus and https://github.com/grafana/k6/pull/2615. @imiric, @inancgumus, @ankur22, feel free to reopen it if you still encounter the issue even after changing how you handle Promises