libp2p / js-libp2p

The JavaScript Implementation of libp2p networking stack.
https://libp2p.io
Other
2.27k stars 436 forks source link

Race Condition on opening a Stream over a DataChannel #2443

Open justin0mcateer opened 4 months ago

justin0mcateer commented 4 months ago

Severity:

High

Description:

DataChannels open but Streams timeout and fail. The main symptom is that we end up with 'one way' Streams, where the incoming Stream was successfully opened, but the outgoing Stream aborts.

Steps to reproduce the error:

The timing window for the race condition is extremely small ~4ms, so the error only occurs when the time required to open a DataChannel is within a very small window. We typically see it when trying to establish connections in a LAN environment, but then the timing varies slightly based on the operating system and browser. Changing browsers may fix the problem, enabling a VPN will almost always fix the problem (due to added latency). The problem seems to occur with much greater frequency on Chrome on MacOS.

We only began seeing this problem when upgrading from 0.46 to 1.1 (now 1.2). The problem seems to have been introduced in the refactoring done here: @achingbrain https://github.com/libp2p/js-libp2p/pull/2237

What we are seeing is that there is a very small delay between checking for the current state of the DataChannel and pEvent actually registering the event handler to capture the 'open' event. Such that if the timing of the DataChannel open event is just right, pEvent never receives the event and throws on timing out. https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L228-L230

Catching the Timeout error and checking again, proceeding if the DataChannel is in fact open works around the issue.

Debugging log points have been added in this log output which illustrate the issue: image

If pEvent caught the 'open' event this log message should be printed, but it is not: https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L231

It seems that due to the way pEvent is implemented, it leaves a tiny window of time between when the pEvent function is called and when 'addEventHandler' is actually called. A custom function that periodically polls, in addition to capturing the event may be required, since the timing is so tight in this case.

justin0mcateer commented 4 months ago

Upon reviewing the pEvent code it more detail and thinking about it a little more. I believe this issue can only occur with events emitted from native code, such as the DataChannel event the WebRTC Transport library is trying to await. Otherwise, the pEvent code is effectively synchronous, so this would not happen with an event emitted from JavaScript code.

justin0mcateer commented 1 month ago

Just to follow up on this, we added a small patch to address this, it was happening fairly regularly and we haven't seen this problem any more after applying the patch.

Basically, we shortened the timeout and we just check it again if pEvent times out. This second check covers the case that pEvent misses the open event from the DataChannel.

I think a better solution long-term here might be to 'poll' the state periodically in addition to listening for the event.

+++ b/dist/src/stream.js
@@ -38,7 +38,7 @@ export const FIN_ACK_TIMEOUT = 5000;
  * When sending data messages, if the channel is not in the "open" state, wait
  * this long for the "open" event to fire.
  */
-export const OPEN_TIMEOUT = 5000;
+export const OPEN_TIMEOUT = 500;
 export class WebRTCStream extends AbstractStream {
     /**
      * The data channel used to send and receive data
@@ -177,7 +177,16 @@ export class WebRTCStream extends AbstractStream {
         }
         if (this.channel.readyState !== 'open') {
             this.log('channel state is "%s" and not "open", waiting for "open" event before sending data', this.channel.readyState);
-            await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+            try {
+               await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+               this.log('channel state is now "%s", sending data', this.channel.readyState);
+            } catch (err) {
+               // check if the channel is open after the timeout
+               if (this.channel.readyState !== 'open') {
+                   this.log.error('channel state is still "%s" after waiting for "open" event', this.channel.readyState);
+                   throw err;
+               }
+            }
             this.log('channel state is now "%s", sending data', this.channel.readyState);
         }
         // send message without copying data
justin0mcateer commented 1 month ago

Would an using 'Promise.all' across pWaitFor and pEvent be an acceptable solution? If so, we would be happy to submit a PR.

achingbrain commented 1 month ago

Sorry for the delay in getting to this, it does look like very strange behaviour.

Looking through the p-event code, it adds the event listeners synchronously as you've found. Any callbacks invoked from native code would be added to the callback queue - they can't interrupt the current event loop tick since js is run-to-completion.

But you're seeing some behaviour that's hard to explain, so maybe something is causing the current tick to end.

I'd like to dig into this further - do you have a way to replicate the problem?

justin0mcateer commented 1 month ago

It is a race condition that happens only if the open is not too fast or too slow. Probably in a window of about 4 to 8 milliseconds based on my testing.

We aren't doing anything special, we are using Circuit-Relay and the WebRTCTransport is managing opening the connections.

The only variable that impacts the ocurrence is the latency of the 'open' event. However, because the handling of creating the DataChannel is all internal to libp2p, I don't see an easy way to inject a delay to trigger the race without either modifying the WebRTCTransport implementation or doing it at the network level.

We have some tests that use 'tc' on Linux to inject packet latency and packet loss between two nodes, but they require a fair amount of setup code building virtual interfaces and bridges to allow controlling the network behavior.

achingbrain commented 1 month ago

4-8ms is an eternity for synchronous code so there's definitely something weird happening.

I've tried to create a repro from your description here https://github.com/achingbrain/libp2p-webrtc-ping

I'm having problems replicating with Chrome*, can you please take a look at tell me what I need to change or add to be able to replicate the issue?

It shouldn't be necessary to modify libp2p internals to replicate the issue - we should be able to do it with external changes only. Since we might want to modify libp2p to fix the issue, we can't introduce new code there that creates the bug, because to "fix" it we'd just delete the code we just added 😅.


* = For Firefox I came across this which is not great, but we'll see what the devs say.

justin0mcateer commented 1 month ago

I agree, code should not need to be added to reproduce the issue. However, since the condition only manifests within a small range of open event latency, ensuring the condition will manifest is not possible without being able to control the latency of the open event.

When you run on two nodes on localhost the latency will be so low that the datachannel is in the open state by the first time the first check happens, so pEvent will never be called.

You'll have to run the listener and the sender on two different machines in the same LAN. What we found was that certain network and browser combinantions would never see this issue, for example, connecting over the Internet or a VPN. Other configurations would see this issue a non-trivial percentage of the time.

We found that the problem seemed to occur more frequently when one of the peers was running Chrome on MacOS. Seems like the open latency was slightly higher with MacOS.

A more deterministic option would be to create two 'veth' interfaces, one for the listener and one for the sender, and use 'tc' to add latency to the network path.

I appreciate your interest in this. It is not an easy one I think. I would be happy to discuss this real-time if it is helpful. You can reach me directly at domain ryzyliant dot com, user justin dot mcateer.

achingbrain commented 4 weeks ago

I've tried running https://github.com/achingbrain/libp2p-webrtc-ping on two MacOS machines on a LAN with both running Chrome but I still can't replicate the issue.

If the problem is that the open event is firing before the listener is added, introducing network latency should decrease the chance of this happening since it would take longer for the data channel to become open and the event to fire so this may be a red herring.

If anything we'd want to reduce latency to expose the issue since the time between the check and the adding of the event listener will be significantly less than 1ms as the code is all synchronous.

That the code is synchronous means that it's not possible to interrupt it, so assuming it all happens in the current tick, there's no way the event can fire before the listener is added, since the callback from the native code that will cause the event to fire will be invoked when the callback queue is processed, which happens after the current tick has ended (e.g. the call stack is empty) and adding the listener will increase the calls in the call stack, since it's done synchronously.

If the native callback is executed in a way that interrupts the JS call stack, the bug is in the browser since that's not how JavaScript works as pointed out.

I need more assistance in replicating this issue as it doesn't seem possible as described.

justin0mcateer commented 4 weeks ago

Thanks again for taking the time to dig into this. We will work on an automated method of reproducing this issue, possibly based on the test repo you put together.

We will try to get to this as quickly as possible, but it may take a couple of weeks.

achingbrain commented 4 weeks ago

Awesome, thanks - I'll check back here in a couple of weeks.