cli: fix wrtc bugs biting us

andrevmatos commented 3 years ago

wrtc is a NodeJS native module providing WebRTC on this environment. It's very fast (written in C++) and provides the same interface as browser's WebRTC objects, allowing us have a single codebase supporting both environments. There're though some bugs on this module which can become a bigger issue if raiden-cli is intended to be a major client on servers, and we need them fixed.

Most of the times, WebRTC data channels can be established and remain like that for a long time, but on cli + wrtc, sometimes it disconnects and can't get connected again. Also, this seems to take down all other WebRTC connections on that node as well, and the process starts consuming 100% of one CPU core.

Fortunately, we managed to detect this state and discard the channel (connectionState becomes failed), falling back to matrix messages (i.e. the client and protocol continue usable, although wasting one core of CPU/power), so the client remains responsive, but it's a pity to lose WebRTC randomly for the rest of a session.

The entry-point (and possibly a cause or lead to the cause) can be seen as a segfault if we try to close the connection. Maybe the 100% CPU issue is because the datachannel gets closed but the peer connection doesn't.

This seems to be more commonly triggered on high-concurrency scenario, like BF6, or a longer-running one, like BF1. When you see a node process starting consuming 100% of one core, you can check the logs and see a RTC: connection failed log and the retry loop for the webRTC kicking in (and looping each 5-60s, failing and retrying).

We should try to fix this bug ourselves upstream, pay for someone to fix it, or put a bounty out to get it fixed.

Some possibly related upstream bugs:

Steps to Reproduce

Run BF1 and/or BF6 scenarios, check logs when/if one node process starts consuming 100% CPU.
Possibly related to the same bug, or a different one: uncomment that connection.close() line, get a client connected via webRTC with another (e.g. by having 2 raiden-cli clients having a Raiden channel between them and being online at the same time), terminates one of them, and see the other segfaults when closing the connection. Usually the terminated one too segfaults on connection.close() while shutting down.

Expected Result

webRTC channels can get reliably connected, closed, retried and reconnected during a session

Actual Result

connection failed errors happen and make webRTC unusable for the rest of that session's lifecycle.

Remark

After this has been fixed revert the ForceExit in the E2E nightly tests (see https://github.com/raiden-network/light-client/issues/2262)

andrevmatos commented 3 years ago

A (possibly more complex) alternative is to replace wrtc. There're not many alternatives, but maybe the go webrtc + wasm could provide a transparent replacement.

andrevmatos commented 3 years ago

Someone reported the situation may be better on Node14. We should test.

christianbrb commented 3 years ago

Is this resolved or accidentally closed?

palango commented 3 years ago

Closed by accident

andrevmatos commented 3 years ago

I've tested the CLI with NodeJS v14, which is the new active LTS, and also re-enabled the connection.close() call on channel's finalize block. I could not reproduce the segfault, and indeed the CLI instance gracefully exited ~1s after it was shutdown, instead of the 10s force-exit we have without that connection.close call, indicating both: 1. it was the RTC connection which held the event-loop from exiting upon normal shutdown; and 2. it seems wrtc can safely close connections without segfaulting on node v14 (possibly due to a build-mismatch with node12 of the pre-built downloaded lib). We need to test a little bit more, specially to try to trigger that infamous infinite loop which prevented wrtc from working on a given session, but hopefully we'll be able to close this issue once we move our infra and minimum requirements to node v14 LTS.

andrevmatos commented 3 years ago

Ok, further testing on high-load BF6 scenario has shown Node v14 indeed helps on allowing connection to be closed without crashing the node, which helps with proper teardown of pending connections and sockets, but I reproduced the high CPU/infinite loop inside wrtc bug on this scenario, if (and looks like only if) I disable the toDevice messages. Not sure what one thing has to do with the other, maybe the slower rooms messages is the trigger for it. Needs further investigation, and possibly getting our hands dirty with C++ code in order to fix this for good.

taleldayekh commented 3 years ago

Closing, our "workaround" makes this not an issue atm.

raiden-network / light-client