Make Signal Work at the South Pole

g1a55er commented 3 weeks ago

First time contributor checklist:

[X] I have read the README and Contributor Guidelines
[X] I have signed the Contributor Licence Agreement

Contributor checklist:

[X] My contribution is not related to translations.
[X] My commits are in nice logical chunks with good commit messages
[X] My changes are rebased on the latest main branch
[X] A yarn ready run passes successfully (more about tests here)
[X] My changes are ready to be shipped to users

Description

I was reading a post on the popular brr.fyi blog, when I came across a familiar looking "unnamed chat app 1" that reportedly did not work at the South Pole. I was quite disappointed to see that WhatsApp.... errrr.... "unnamed chat app 2" works fine in these conditions, so I decided to treat this post as a bug report.

The anonymous author was kind enough to provide the following debug log from the field:

There are a few things that come to mind when analyzing the problem of improving performance on low-throughput-high-latency connectivity connections:

1) The WebSocket establishment handshake takes many roundtrips (SYN, SYN-ACK, ACK, ClientHello, ServerHello, TLS Negotiation, Web Socket Upgrade). I would imagine "competitor app 2" is probably using QUIC or something similar by now to consolidate these down into the bare minimum. This is, in my opinion, the main technical root cause of the performance differential. However, I completely understand that organizational and resource constraints probably make this a high-investment, longer term item to fix. 2) The WebSocket abstraction provided by the TypeScript library doesn't provide us with any window into if the handshake is making progress or not. Ideally, I'd like to set it to bail out after we stop making progress through the various stages of the bootstrap handshake, but to keep going by resetting the timer if we are making good progress. Unfortunately, we don't have visibility on if we are still making progress, or if it has all just failed. We only just get a final connection event when connection is fully established. 3) So, we're left with a trade-off between improving success rates on low end connections and increasing wasted wait time in the disconnected failure states while we wait for doomed connection attempts to eventually fail. Ideally, we'd have some data we could use to help drive this decision: How many failures are we getting? How long is the P99 success case? How many false positives will we get for a given connection timeout value? I don't have access to any data like that for the Signal user base, so I just choose a simple <10s, 20s, 30s, 30s,...> progression. Hopefully this improves success rates without hurting reconnect speed for most users.

I'd like to take a deeper dive into these issues if I get some more time. 🤞

Test Strategy

[x] Ran yarn ready and got a pass [x] Ran 'yarn start' and nothing seemed obviously more broken than staging normally is.

g1a55er commented 3 weeks ago

I saw this needed some work when I did some testing earlier this morning using the macOS link conditioner under a custom "South Pole" profile.

Screenshot 2024-06-03 at 2 50 26 PM

I based these settings on the conditions reported in the blog post.

With these changes, I can now connect to Signal with the link conditioner set to those settings if I am very, very patient. Diving into how the app behaves, I think there's a few other areas for improvement I spotted that I'm noting down for my own reference:

The DNS cache does not appear to be very aggressive. These leads to a pretty pathological error state where the client will keep hammering the DNS resolver multiple times simultaneously for the same name resolution.
I'm not seeing any sort of connection pooling or other connection strategy going on here? I get that we want to keep the unauthenticated and the authenticated socket separate for privacy reasons, but I think some sort of connection engine that is aware of that constraint could do a pretty good job standing up working sockets over bad connections faster.

jamiebuilds-signal commented 3 weeks ago

Thank you for the pull request. We're actually working on a longer term project to move a lot of this to a Rust library and discussing how we could improve this. However, just increasing the timeouts pushes the problem downstream to a lot of places in the app and causes other issues. So I think for now we'll defer doing this

signalapp / Signal-Desktop