Closed tdzienniak closed 1 year ago
Sounds like it could be related to #153? We haven't put much effort into solving the case when the signaling server closing off the connection, perhaps that's what's happening somehow?
I don't suppose you get any tracing errors or warnings?
Also sounds to me like this error is caused when the signaler panics, which we removed the unwrap for a week or two ago. Is your matchbox_server up to date? i'm not sure I would burn the calories in this either, probably better to use the new signaling API since we're preparing for a new release.
I don't suppose you get any tracing errors or warnings?
@johanhelsing I can send a complete stack trace in the evening, but it's mostly filled with cryptic WASM references, so I don't know how helpful it is.
Also sounds to me like this error is caused when the signaler panics, which we removed the unwrap for a week or two ago. Is your matchbox_server up to date? i'm not sure I would burn the calories in this either, probably better to use the new signaling API since we're preparing for a new release.
@simbleau I'm using the version from the main
branch.
The weird thing is that when I run the server locally (binding to 0.0.0.0) then it works. The connection can't be established between peers using Chromium-based browser only when I deploy the server on the VM in GCP. The WS connection is working fine, so maybe there is something in the generated offer itself that is causing the problem. I'm also not very familiar with what this WebRTC offer should contain so it's hard for me to debug.
Hm. Yeah the only thing I can think of is to generate more debug information, so it's easier for us to fix.
Side question: Can you try other browsers?
Something tells me that binding to 0.0.0.0 isn't friendly in GCP, or the GCP VM has a symmetric NAT or something weird.
I do plan on crossing this road pretty soon, too, since I'll be deploying some services to AWS soon. For now I'm limited on bandwidth developing other parts of Matchbox. Hoping Johan can be more helpful in the interim, but I think this will solve itself when I get to this hurdle in the next week or two.
I was thinking about logs from warn! and error!
So looks like the new docker image broken:
It exits with code 0 no matter what arguments you pass it.
0.5:
I'd like to fix this asap, as the 0.6 crates are already out. anyone got an idea what could be wrong? @simbleau @garryod ?
I tried running it from within the builder image, same thing there
@johanhelsing I actually used a different Docker image I built myself - not the official one. The matchbox_server inside it was running just fine.
oh, ok. sorry for hijacking your issue then. Continuing my issue in #197
@johanhelsing @simbleau
I'm attaching two screenshots from sessions using the simple
example from the repo.
This is the first one which is successful. In this case I used matchbox_server
running locally:
This is the second one - you can see that the first client panicked after the connection was established. Here I was using matchbox_server
deployed to GCP:
As you can see in this examples the order of the messages in the console is different. Maybe that is the clue, that the signaling loop in webrtc_socket
module is somehow sensitive to signal order...
@johanhelsing @simbleau I'm attaching two screenshots from sessions using the
simple
example from the repo.This is the first one which is successful. In this case I used
matchbox_server
running locally:This is the second one - you can see that the first client panicked after the connection was established. Here I was using
matchbox_server
deployed to GCP:As you can see in this examples the order of the messages in the console is different. Maybe that is the clue, that the signaling loop in
webrtc_socket
module is somehow sensitive to signal order...
Right, I pretty much experience this TrySendError everytime the signaling panics.
That's my guess. Surely you have a way to get the logs on GCP? I can't help if I don't have the signaling server's logs.
@simbleau, the signaling server doesn't panic.
But I managed to find the root cause. As you can see in the screenshot I posted, the signal IceCandidate
is received after the handshake is completed. So at this point, the UnboundedReceiver
connected to the sender in handshake_signals
hash map is already droped, so the UnboundedSender
is closed. That's why the unbounded_send
fails and panics. I'm refering to the signaling loop in this file (and specifically to this line). I managed to fix this by checking if the sender is closed and only sending a signal if it is not: look here.
One thing that I don't like about the fix is that it doesn't solve the root cause. I'll try to come up with a better solution, but I'd like to ask you if I'm thinking in the right direction.
Ah yes. I believe there's unwraps like this all over the place.
If a client panics, all of the channels get dropped and unfortunately we don't have guards in the places that need them to prevent the peers attached from also panicking.
This is an ongoing issue... #91
As for this issue, I think your solution is great - we should eliminate that unwrap and print a warning it will be ignored.
If there's a bug that reproduces this, we should fix it, but the change should still be applied since the root cause could be any reason the channel gets dropped (client exits their program during comm).
This is a regression in #128. Was not a panic on wasm in 0.5. Should probably make a 0.6.1 release
I tried to deploy signaling server via Docker on GCP, but I get weird panics on Chromium based browsers when trying to establish new connection using this server. When I run the same Docker image locally everything works. It also works in Firefox, even with the remote signaling server. The error says:
You can reproduce this problem using the
simple
example just changing the signaling server URL towss://signaling.borygo.club/test_room
(it's the server I deployed). The Docker image is here.Does anyone have any ideas what could be the problem?