johanhelsing / matchbox

Painless peer-to-peer WebRTC networking for rust wasm (and native!)
Apache License 2.0
889 stars 73 forks source link

Connection error when using matchbox_server deployed in GCP matchbox_server docker image is broken #183

Closed tdzienniak closed 1 year ago

tdzienniak commented 1 year ago

I tried to deploy signaling server via Docker on GCP, but I get weird panics on Chromium based browsers when trying to establish new connection using this server. When I run the same Docker image locally everything works. It also works in Firefox, even with the remote signaling server. The error says:

panicked at 'failed to forward signal to handshaker: TrySendError { kind: Disconnected }', /home/dummy/matchbox/matchbox_socket/src/webrtc_socket/mod.rs:189:53

You can reproduce this problem using the simple example just changing the signaling server URL to wss://signaling.borygo.club/test_room (it's the server I deployed). The Docker image is here.

Does anyone have any ideas what could be the problem?

johanhelsing commented 1 year ago

Sounds like it could be related to #153? We haven't put much effort into solving the case when the signaling server closing off the connection, perhaps that's what's happening somehow?

johanhelsing commented 1 year ago

I don't suppose you get any tracing errors or warnings?

simbleau commented 1 year ago

Also sounds to me like this error is caused when the signaler panics, which we removed the unwrap for a week or two ago. Is your matchbox_server up to date? i'm not sure I would burn the calories in this either, probably better to use the new signaling API since we're preparing for a new release.

tdzienniak commented 1 year ago

I don't suppose you get any tracing errors or warnings?

@johanhelsing I can send a complete stack trace in the evening, but it's mostly filled with cryptic WASM references, so I don't know how helpful it is.

Also sounds to me like this error is caused when the signaler panics, which we removed the unwrap for a week or two ago. Is your matchbox_server up to date? i'm not sure I would burn the calories in this either, probably better to use the new signaling API since we're preparing for a new release.

@simbleau I'm using the version from the main branch.

The weird thing is that when I run the server locally (binding to 0.0.0.0) then it works. The connection can't be established between peers using Chromium-based browser only when I deploy the server on the VM in GCP. The WS connection is working fine, so maybe there is something in the generated offer itself that is causing the problem. I'm also not very familiar with what this WebRTC offer should contain so it's hard for me to debug.

simbleau commented 1 year ago

Hm. Yeah the only thing I can think of is to generate more debug information, so it's easier for us to fix.

Side question: Can you try other browsers?

Something tells me that binding to 0.0.0.0 isn't friendly in GCP, or the GCP VM has a symmetric NAT or something weird.

I do plan on crossing this road pretty soon, too, since I'll be deploying some services to AWS soon. For now I'm limited on bandwidth developing other parts of Matchbox. Hoping Johan can be more helpful in the interim, but I think this will solve itself when I get to this hurdle in the next week or two.

johanhelsing commented 1 year ago

I was thinking about logs from warn! and error!

johanhelsing commented 1 year ago

So looks like the new docker image broken: image

It exits with code 0 no matter what arguments you pass it.

0.5:

image

johanhelsing commented 1 year ago

I'd like to fix this asap, as the 0.6 crates are already out. anyone got an idea what could be wrong? @simbleau @garryod ?

johanhelsing commented 1 year ago

I tried running it from within the builder image, same thing there

tdzienniak commented 1 year ago

@johanhelsing I actually used a different Docker image I built myself - not the official one. The matchbox_server inside it was running just fine.

johanhelsing commented 1 year ago

oh, ok. sorry for hijacking your issue then. Continuing my issue in #197

tdzienniak commented 1 year ago

@johanhelsing @simbleau I'm attaching two screenshots from sessions using the simple example from the repo.

This is the first one which is successful. In this case I used matchbox_server running locally:

success

This is the second one - you can see that the first client panicked after the connection was established. Here I was using matchbox_server deployed to GCP:

failure

As you can see in this examples the order of the messages in the console is different. Maybe that is the clue, that the signaling loop in webrtc_socket module is somehow sensitive to signal order...

simbleau commented 1 year ago

@johanhelsing @simbleau I'm attaching two screenshots from sessions using the simple example from the repo.

This is the first one which is successful. In this case I used matchbox_server running locally:

success

This is the second one - you can see that the first client panicked after the connection was established. Here I was using matchbox_server deployed to GCP:

failure

As you can see in this examples the order of the messages in the console is different. Maybe that is the clue, that the signaling loop in webrtc_socket module is somehow sensitive to signal order...

Right, I pretty much experience this TrySendError everytime the signaling panics.

That's my guess. Surely you have a way to get the logs on GCP? I can't help if I don't have the signaling server's logs.

tdzienniak commented 1 year ago

@simbleau, the signaling server doesn't panic.

But I managed to find the root cause. As you can see in the screenshot I posted, the signal IceCandidate is received after the handshake is completed. So at this point, the UnboundedReceiver connected to the sender in handshake_signals hash map is already droped, so the UnboundedSender is closed. That's why the unbounded_send fails and panics. I'm refering to the signaling loop in this file (and specifically to this line). I managed to fix this by checking if the sender is closed and only sending a signal if it is not: look here.

One thing that I don't like about the fix is that it doesn't solve the root cause. I'll try to come up with a better solution, but I'd like to ask you if I'm thinking in the right direction.

simbleau commented 1 year ago

Ah yes. I believe there's unwraps like this all over the place.

If a client panics, all of the channels get dropped and unfortunately we don't have guards in the places that need them to prevent the peers attached from also panicking.

This is an ongoing issue... #91

As for this issue, I think your solution is great - we should eliminate that unwrap and print a warning it will be ignored.

If there's a bug that reproduces this, we should fix it, but the change should still be applied since the root cause could be any reason the channel gets dropped (client exits their program during comm).

johanhelsing commented 1 year ago

This is a regression in #128. Was not a panic on wasm in 0.5. Should probably make a 0.6.1 release