signalapp / Signal-Android

A private messenger for Android.
https://signal.org
GNU Affero General Public License v3.0
25.72k stars 6.17k forks source link

Signal not working on Mobile across a number of devices and connections #11839

Closed quaqo closed 2 years ago

quaqo commented 2 years ago

Bug description

Since a few days for me, my wife, a bunch of friends and family members on different mobile networks in Italy, Signal stopped working. As soon as the device is connected to WiFi, messages gets routed, otherwise there's no communication in or out.

A friend even tried to uninstall and reinstall signal and it didn't let him re-register, saying "no network".

This happens on a range of devices with different vendors (Samsung, Xiaomi, Huawei) Android versions (8, 9, 10 and 11) and different data and WiFi providers. The only common thing is the location (Italy) and the app version (5.27.13).

Steps to reproduce

Actual result: Spinning icon untill connected to WiFi. No messages delivered in or out. Expected result: Messages should be delivered.

Device info

Device: Samsung, Xiaomi, Huawei (various) Android version: 8, 9, 10 and 11 Signal version: 5.27.13

Link to debug log

I attach one of the devices debug log, the errors are the same across all devices (I think the SSL handshake fails).

https://debuglogs.org/b8d5664acd597b3447f7c36a53fd91161b7420c611a149c8b2a86090220c85fb

quaqo commented 2 years ago

Another debug log with more data:

https://debuglogs.org/1621f1fc967624e2d1cc84bbbd346a54bdfc9aa068a9c4f247c4b9684a00ae4b

quaqo commented 2 years ago

I verified indipendently that on 3 different data providers connections to:

https://textsecure-service.whispersystems.org/

Fail. (outside Signal app also). They time out.

Connections to: https://chat.signal.org are OK.

cody-signal commented 2 years ago

Thanks for the extra info. I've passed it onto the team. I'm curious if you can test with an iOS device to see if it's Android specific or something at a higher level.

Edit: Additionally, is there anything unique about the setup of the devices like proxies/vpns/etc?

quaqo commented 2 years ago

Thanks for the extra info. I've passed it onto the team. I'm curious if you can test with an iOS device to see if it's Android specific or something at a higher level.

Edit: Additionally, is there anything unique about the setup of the devices like proxies/vpns/etc?

Thanks. Actually that's what I wanted to try, but we don't have access to iOS apart from iPads (I'm "debugging" on behalf of 9 other people: I'm the only tech savvy one and the one responsible for making the other users switch to Signal... :-)).

I guided them all and the results is the same: even thou it's resolved to the same IP, https://textsecure-service.whispersystems.org/, is not accessible from Data, just WiFi.

The thing is that we all have different mobile providers and fixed WiFi providers... So it's very very strange that this is constant.

I tried setting up a VPN service between my phone and my home network and it works, so I'm inclined to say it's not an android problem, but I have no way to confirm it.

It could be solved as I saw a commit that switches the domains.

Yet I emphasize how strange it is that "https://textsecure-service.whispersystems.org/" is not accessible on such a different range of devices and providers. I think I'm missing something... If you have some other tests to run to propose, please do. I'll coordinate with the other "non techie" people that I know are experiencing this issue.

quaqo commented 2 years ago

I tried setting up a VPN service between my phone and my home network and it works, so I'm inclined to say it's not an android problem, but I have no way to confirm it.

Tried the other way around. Linux PC using the phone's connection with Wireshark on it.

TCP connection to:

quaqo commented 2 years ago

As an interesting update: I have ONE of the 9 people do the same test and he can't connect to https://textsecure-service.whispersystems.org/ even on WiFi (nor to https://chat.signal.org), BUT they can send messages (only on WiFi).

So maybe it's not related. Are there other endpoints to test?

UPDATE: I stand corrected, he didn't properly run the test. He can't connect to both on Mobile, but he can connect on WiFi.

quaqo commented 2 years ago

Ok we tried:

On most phones TEXTSECURE-SERVICE is blocked, on others CDN or CDN2, but only on Mobile. They work on WiFi.

quaqo commented 2 years ago

Edit: Additionally, is there anything unique about the setup of the devices like proxies/vpns/etc?

No.

The only commonalities are that different mobile data providers are not letting https://textsecure-service.whispersystems.org/ (or 1 case also https://chat.signal.org) thru.

Apart from that I can't see anything else for the time being. I'm open to suggestions.

Btw, tried to contact other friends my "group" just got bigger, we're now 11 people that have this problem: these two people just assumed Signal was out since a few days and resorted to WhatsApp...

quaqo commented 2 years ago

An additional data point. I managed to find 2 iOS users: can't access the URL, but I see that iOS uses chat.signal.org.

cody-signal commented 2 years ago

apart from iPads

ipads would still work as a valid test to see if it's networking level or android level.

It could be solved as I saw a commit that switches the domains.

You can try out the 5.28.x beta via https://community.signalusers.org/t/beta-feedback-for-the-upcoming-android-5-28-release/39659

Tried the other way around. Linux PC using the phone's connection with Wireshark on it.

Are you able to run a DNS lookup on textsecure while tethered to your mobile network and see what comes back? Does the same IP come back that you tried or something else?

I'm going to escalate again.

quaqo commented 2 years ago

apart from iPads

ipads would still work as a valid test to see if it's networking level or android level.

Please see this. Seems a networking level. But very common. At least geographically (I can't find any comments on the Internet about people with the same issue!).

It could be solved as I saw a commit that switches the domains.

You can try out the 5.28.x beta via https://community.signalusers.org/t/beta-feedback-for-the-upcoming-android-5-28-release/39659

Ok will do.

Tried the other way around. Linux PC using the phone's connection with Wireshark on it.

Are you able to run a DNS lookup on textsecure while tethered to your mobile network and see what comes back? Does the same IP come back that you tried or something else?

I did. They resolve

On Mobile:

Addresses:  76.223.92.165
          13.248.212.111

On WiFi:

Addresses:  13.248.212.111
          76.223.92.165

I'm going to escalate again.

quaqo commented 2 years ago

It could be solved as I saw a commit that switches the domains.

You can try out the 5.28.x beta via https://community.signalusers.org/t/beta-feedback-for-the-upcoming-android-5-28-release/39659

Ok will do.

Beta 5.28.3 -> It worked a few minutes. Then it stopped again.

So maybe the routing issue is not the only/main culprit.

Attached debug log:

https://debuglogs.org/94c6055ebe1921c97d8d96608c146863b9e937e39691239dd090c70e022bf531

cody-signal commented 2 years ago

Looking into it from our side, we can't find much wrong. Everything appears to be working as expected. ~Is everyone you are testing with on a Xiaomi or Huawei device or is it across vendors as well?~ (You mentioned the issue with connecting on iOS directly so not a vendor thing.) Would you be able to let us know what carriers you are encountering issues with?

quaqo commented 2 years ago

Sure. Vendors are

Carriers are:

I have to add one carrier that WORKS:

This carrier started out as an ISP and it's the only one of which I can see the routing happening on the same network as I happen to have a WiFi connection also.

gram-signal commented 2 years ago

It sounds like you might have access to packet capture to debug this; I'm wondering if we're doing a TCP handshake at all, or if it's failing as part of that. If you're comfortable with it, would you be willing to capture and provide a pcap? Given the name resolution, this BPF filter should narrow things down: "host 76.223.92.165 or host 13.248.212.111". I'm specifically interested in whether a SYN/ACK packet is coming back from the initial SYN you're sending.

If the SYN/ACK is coming back (and again, if your test platform has the capability), it would also be very interesting to see the output of this:

echo '' | openssl s_client -connect chat.signal.org:443 -servername chat.signal.org -showcerts

quaqo commented 2 years ago

I'm specifically interested in whether a SYN/ACK packet is coming back from the initial SYN you're sending.

It does but then it doesn't ACK to the Client Hello packet. This is what happens:

out

If the SYN/ACK is coming back (and again, if your test platform has the capability), it would also be very interesting to see the output of this

$ echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
CONNECTED(00000004)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 349 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

If I try with chat.signal.org, after the "Client Hello" I get an ACK back and then "Server Hello" and the whole SSL handshake takes place and everybody is happy.

quaqo commented 2 years ago

As an addendum it seems that with the beta I have to rectify this statement:

Beta 5.28.3 -> It worked a few minutes. Then it stopped again.

Messages are now erratically working (but most of the time they are, so chat.signal.org does make a difference), but received media are not showing properly (they're showed blurred, but they're not properly downloaded until there's a WiFi connection - "Download only on WiFi" et similar options are not active, just to be clear).

Sending media doesn't work at all on mobile data.

gram-signal commented 2 years ago

Went back and forth a bit with AWS; we believe there is some issue in the path of its routing of the requests through to our backend servers. We utilize a Global Accelerator to pull cross-region traffic over AWS' backbone to our serving region, then an Elastic Load Balancer to choose which server gets the request, and our current hypothesis is that the issue lies in one of these pieces of infra.

We should be able to diagnose which by bypassing the Global Accelerator for some requests but not others. IIUC, the following test should help us narrow down to a specific subsystem:

# Repeat test from before; by hitting the textsecure-service.whispersystems.org FQDN, we're going through GA
echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts

# Now hit IPs of the ELB without routing them through GA.  If this works, we believe the issue is with GA; if it fails, ELB appears to be the cause
echo '' | openssl s_client -connect Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 -servername textsecure-service.whispersystems.org -showcerts

If possible, could we get an actual PCAP file associated with a failure? AWS support is interested in checking out things in more detail (IP/TCP flags/options/ttl/etc). I will note, however, that this could contain your NIC's IP and MAC address; if you're interested in not making those public, totally understand.

quaqo commented 2 years ago

Went back and forth a bit with AWS; we believe there is some issue in the path of its routing of the requests through to our backend servers. We utilize a Global Accelerator to pull cross-region traffic over AWS' backbone to our serving region, then an Elastic Load Balancer to choose which server gets the request, and our current hypothesis is that the issue lies in one of these pieces of infra.

I see. Thanks for taking me seriously and looking so promptly into this! I already have some of the people I mentioned dismissing the use of Signal because of this, and as a long time user that finally got many people on board this year it kinda puts me down! Also I want to use my messenger of choice!

We should be able to diagnose which by bypassing the Global Accelerator for some requests but not others. IIUC, the following test should help us narrow down to a specific subsystem:

# Repeat test from before; by hitting the textsecure-service.whispersystems.org FQDN, we're going through GA
echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts

This connection times out as before. Nothing back.

# Now hit IPs of the ELB without routing them through GA.  If this works, we believe the issue is with GA; if it fails, ELB appears to be the cause
echo '' | openssl s_client -connect Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 -servername textsecure-service.whispersystems.org -showcerts

This works! I get a good SSL handshake and the correct self-signed certificate is presented back to me.

If possible, could we get an actual PCAP file associated with a failure? AWS support is interested in checking out things in more detail (IP/TCP flags/options/ttl/etc). I will note, however, that this could contain your NIC's IP and MAC address; if you're interested in not making those public, totally understand.

As a Signal user I value my privacy. I understand the need of this, but if I could back off from doing this while still being of help I'd appreciate it.

Thank you.

undrivendev commented 2 years ago

I seem to have the same problems described by @quaqo of messages erraticaly working with the provider 'Ho Mobile'. I didn't go so far as to collect network logs, but I suspect it's the same problem. By using another carrier (Fastweb Mobile) everything works fine.

quaqo commented 2 years ago

@gram-signal or @cody-signal, are you able to provide any update? Thank you!

jon-signal commented 2 years ago

Hello! I'm Jon from the server engineering team. @gram-signal is out this week, so I'll be taking over the server side of things from here.

Thank you for the additional reports and detailed debugging. It sounds like we have a reasonably clear picture of what's going on. We'll be taking steps today to resolve the issue, and I'll report back shortly.

quaqo commented 2 years ago

Hello! I'm Jon from the server engineering team. @gram-signal is out this week, so I'll be taking over the server side of things from here.

Hi Jon! Right.

Thank you for the additional reports and detailed debugging. It sounds like we have a reasonably clear picture of what's going on. We'll be taking steps today to resolve the issue, and I'll report back shortly.

Oh, that's great news! Thanks!

Have a nice day!

jon-signal commented 2 years ago

Friends, we're still working to resolve the root cause of the issue, but for now, we've routed things around what we believe to be problematic piece of infrastructure for Italian users. At your convenience, please let us know if things seem to be working better for you.

Thank you!

quaqo commented 2 years ago

Friends, we're still working to resolve the root cause of the issue, but for now, we've routed things around what we believe to be problematic piece of infrastructure for Italian users. At your convenience, please let us know if things seem to be working better for you.

Just FYI this:

echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts

still fails.

jon-signal commented 2 years ago

Ah--I misread a critical detail earlier. My apologies. I've redirected chat.signal.org, but not textsecure-service.whispersystems.org. I'll revise what we're doing and check back in shortly.

jon-signal commented 2 years ago

Friends, thanks for your patience. We've now applied the routing change for textsecure-service.whispersystems.org in Italy. If you could give things another try and let us know how it goes, that would be greatly appreciated.

undrivendev commented 2 years ago

Just tried

echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts

And it still fails. chat.signal.org is working fine instead. Provider: 'Ho Mobile'.

Edit: attached command output. textsecure-service.whispersystems.org.log chat.signal.org.log

quaqo commented 2 years ago

Just tried

echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts

And it still fails. chat.signal.org is working fine instead. Provider: 'Ho Mobile'.

I confirm it fails

I still can send messages (the last Android version uses chat.signal.org as endpoint), but I can't send or receive media. Nor interact with messages (e.g reply to) containing media.

jon-signal commented 2 years ago

Thank you for following up! I'm sorry to hear that things still aren't working, this is helpful information that helps us close in on the root cause.

My current hypothesis is that some mobile providers are actually blocking these specific domains, and it's not actually a routing issue at this point. To test that hypothesis, I'm afraid I have to ask for some more information and testing.

First, can you please share the output of dig +short textsecure-service.whispersystems.org and dig +short chat.signal.org (via the affected mobile connection, of course)? If those domains are, indeed, pointing to the same IP addresses, that strongly suggests the problem isn't in the route, but has something to do with the domain name itself. If not, there may be something with our "route around the obstacle" strategy.

Next, let's make sure that the DNS entry for our global accelerator instance is working as expected: dig +short ac88393aca5853df7.awsglobalaccelerator.com. I expect you'll find two IP addresses in the response, but let's double-check to make sure.

To rule out an internet routing problem, let's see if we can connect directly to the GA IPs:

Finally, it'd be helpful to understand the route to both GA IPs:

I recognize that the traceroute output may be more than you're comfortable sharing in public forum; if everything else is working as expected and you'd rather not share, that's completely fine. If you're having trouble reaching the GA IPs, though, it'd be very helpful to understand the routes involved, and we can try to work out a more private channel for sharing that information. Let's cross that bridge if we get to it, though.

Whew.

Again, please accept my thanks for helping us dig into this issue! To recap, the main thrust here is that we're testing a hypothesis that the routes themselves are fine and that the same traffic traveling along the same route with different domain names will wind up with different results. With that information, we can narrow down where, in the vast internet, this problem is really happening.

Cheers!

quaqo commented 2 years ago

Thank you for following up! I'm sorry to hear that things still aren't working, this is helpful information that helps us close in on the root cause.

No worries! Thanks to you for your help.

My current hypothesis is that some mobile providers are actually blocking these specific domains, and it's not actually a routing issue at this point.

Honestly, I don't think that's the problem. This explanation by @gram-signal sounded more likely:

Went back and forth a bit with AWS; we believe there is some issue in the path of its routing of the requests through to our backend servers. We utilize a Global Accelerator to pull cross-region traffic over AWS' backbone to our serving region, then an Elastic Load Balancer to choose which server gets the request, and our current hypothesis is that the issue lies in one of these pieces of infra.

Anyways, I'm in no position to verify so I'll still help debug this, but it wouldn't make sense to actively block a single domain in a geographic region (Italy) if this is what you're suggesting.

To test that hypothesis, I'm afraid I have to ask for some more information and testing.

First, can you please share the output of dig +short textsecure-service.whispersystems.org and dig +short chat.signal.org (via the affected mobile connection, of course)? If those domains are, indeed, pointing to the same IP addresses, that strongly suggests the problem isn't in the route, but has something to do with the domain name itself. If not, there may be something with our "route around the obstacle" strategy.

I had already explained the situation here. chat.signal.org and textsecure-service.whispersystems.org do point to the same IPs.

What seems to have changed now (I don't know if this is part of what you did to bypass the problem) is that instead of pointing to 13.248.212.111 / 76.223.92.165, textsecure-service.whispersystems.org is CNAME for chat.signal.org which in turn resolve to a bunch of AWS nodes.

Here are the outputs:

$ dig +short textsecure-service.whispersystems.org
chat.signal.org.
3.217.34.249
52.202.220.55
54.227.133.78
52.86.184.26
52.206.227.248
34.196.10.25
3.225.46.143
52.54.79.162
$ dig +short chat.signal.org
18.235.44.161
35.171.56.38
52.22.161.181
35.175.8.205
3.221.216.251
54.165.62.59
54.227.133.78
54.147.219.6

Next, let's make sure that the DNS entry for our global accelerator instance is working as expected: dig +short ac88393aca5853df7.awsglobalaccelerator.com. I expect you'll find two IP addresses in the response, but let's double-check to make sure.

This does resolve to the original IPs I mentioned in my first debug round:

dig +short ac88393aca5853df7.awsglobalaccelerator.com
13.248.212.111
76.223.92.165

To rule out an internet routing problem, let's see if we can connect directly to the GA IPs:

  • netcat -vz 13.248.212.111 443
  • netcat -vz 76.223.92.165 443

Positive. As explained here, no problem connecting directly to the IPs OR to chat.signal.org

Finally, it'd be helpful to understand the route to both GA IPs:

  • traceroute 13.248.212.111
  • traceroute 76.223.92.165

I recognize that the traceroute output may be more than you're comfortable sharing in public forum; if everything else is working as expected and you'd rather not share, that's completely fine. If you're having trouble reaching the GA IPs, though, it'd be very helpful to understand the routes involved, and we can try to work out a more private channel for sharing that information. Let's cross that bridge if we get to it, though.

Whew.

Again, please accept my thanks for helping us dig into this issue! To recap, the main thrust here is that we're testing a hypothesis that the routes themselves are fine and that the same traffic traveling along the same route with different domain names will wind up with different results. With that information, we can narrow down where, in the vast internet, this problem is really happening.

Cheers!

As explained here this happens across different ISPs most of which have entirely separate network infrastructure (they don't even connect to the same IXPs.

The only common thing is geographical location as far as I can tell.

I want to stress also that this issue started on Dec, 9th and never manifested before. As far as I know it's the same day AWS had problems.

I didn't reported straight away because at first I though to give it time, and then I didn't knew so many people were affected.

When I started to receive accounts from family, friends and collegues that "the app you made me install, Signal is unreliable" I started investigating and opened the ticket here.

jon-signal commented 2 years ago

Thanks again for following up! There's some new information here, but you've also expressed some concerns. Let's review the situation as a whole to make sure we're on the same page, then try to move forward with the debugging process.

You wrote:

This explanation by @gram-signal sounded more likely...

...there is some issue in the path of its routing of the requests through to our backend servers.

I agree! That certainly did sound more likely until we gathered more evidence that makes it seem less likely now. I know I made a mistake earlier in updating our "route around GA" rules yesterday, but I promise I have been following and understand the conversation up to this point.

To recap, initial testing showed that you could connect to Signal's infrastructure by using chat.signal.org or by connecting directly to the load balancer, but not by going through textsecure-service.whispersystems.org. In other words:

Endpoint openssl s_client -connect result
textsecure-service.whispersystems.org:443 Timeout
Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 Success
chat.signal.org:443 Success
76.223.92.165:443 Success
13.248.212.111:443 Success

We also verified in https://github.com/signalapp/Signal-Android/issues/11839#issuecomment-992595371 that textsecure-service.whispersystems.org resolved to:

...and those are our Global Accelerator addresses.

At this point, we made a change such that chat.signal.org points directly to our load balancer (Signal-Production-956696213.us-east-1.elb.amazonaws.com) instead of global accelerator (76.223.92.165/13.248.212.111) for users in Italy. We also changed textsecure-service.whispesystems.org to be a CNAME record that points to chat.signal.org. It does appear that those changes are visible to your local nameserver, and that our domains are now pointing at the load balancer instead of the global accelerator for you.

Now, having made those changes, my understanding of the testing matrix is that we're still looking at:

Endpoint openssl s_client -connect result
textsecure-service.whispersystems.org:443 Timeout
Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 Success
chat.signal.org:443 Success
76.223.92.165:443 Success
13.248.212.111:443 Success

In other words, we've verified that the domains in question are indeed pointing at new targets, but the results have not changed. Between this and the previous round of testing, we're able to demonstrate that we can connect to Signal's servers in several different ways:

  1. By connecting directly to the load balancer
  2. By connecting through global accelerator
  3. By connecting through a domain that's pointed to the load balancer (chat.signal.org before yesterday's changes)
  4. By connecting through a domain that's pointed to the global accelerator (chat.signal.org after yesterday's changes)

We also believe that several other domains remain similarly unreachable, including:

What's exceptionally strange about that situation is that those domains are not only not behind our load balancer or global accelerator, they're hosted by different providers entirely (i.e. not AWS). That said, you wrote:

As explained here this happens across different ISPs most of which have entirely separate network infrastructure (they don't even connect to the same IXPs.

I promise I understand that, and I'm just as puzzled as you are! Not only do we have multiple providers in Italy having difficulty connecting, they're having difficulty connecting to multiple different services on our end. That's very, very strange.

At this point, I don't want to speculate too much about what's going on (the point is still to form hypotheses and test them with evidence), but I do want to assure you that I am and have been reading everything carefully and I hear you. With that in mind, it may be that these mobile network operators share a common vendor for (again, speculating wildly) spam filtering software, and that third-party vendor has mistakenly added these domains to the list. I don't mean to assert that's what is happening, but I don't think it's impossible.

Again, the thing that would really, really help us understand what's going on here is traceroute output. Would you be willing to send the output of the following to me directly?

If so, please feel free to email me at [my first name] at signal.org!

Once again, thanks for your patience and help in debugging all this. I know it's been a long road, and I understand that Signal is just plain not working for you and your contacts right now. We'll get through this.

quaqo commented 2 years ago

Thanks again for following up! There's some new information here, but you've also expressed some concerns. Let's review the situation as a whole to make sure we're on the same page, then try to move forward with the debugging process.

Thanks again to you Jon. Before we move on, I feel the need to apologize, I get from your response that somehow I conveyed the wrong message to you and implied something negative with my comment about what is more likely or not. That was not my intention at all, English is not my mother tongue and if I offended you I indeed apologize. I don't want to cause any misunderstanding about this!

You wrote:

This explanation by @gram-signal sounded more likely...

...there is some issue in the path of its routing of the requests through to our backend servers.

I agree! That certainly did sound more likely until we gathered more evidence that makes it seem less likely now. I know I made a mistake earlier in updating our "route around GA" rules yesterday, but I promise I have been following and understand the conversation up to this point.

To recap, initial testing showed that you could connect to Signal's infrastructure by using chat.signal.org or by connecting directly to the load balancer, but not by going through textsecure-service.whispersystems.org. In other words:

Endpoint openssl s_client -connect result textsecure-service.whispersystems.org:443 Timeout Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 Success chat.signal.org:443 Success 76.223.92.165:443 Success 13.248.212.111:443 Success We also verified in #11839 (comment) that textsecure-service.whispersystems.org resolved to:

  • 76.223.92.165
  • 13.248.212.111

...and those are our Global Accelerator addresses.

At this point, we made a change such that chat.signal.org points directly to our load balancer (Signal-Production-956696213.us-east-1.elb.amazonaws.com) instead of global accelerator (76.223.92.165/13.248.212.111) for users in Italy. We also changed textsecure-service.whispesystems.org to be a CNAME record that points to chat.signal.org. It does appear that those changes are visible to your local nameserver, and that our domains are now pointing at the load balancer instead of the global accelerator for you.

Now, having made those changes, my understanding of the testing matrix is that we're still looking at:

Endpoint openssl s_client -connect result textsecure-service.whispersystems.org:443 Timeout Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 Success chat.signal.org:443 Success 76.223.92.165:443 Success 13.248.212.111:443 Success In other words, we've verified that the domains in question are indeed pointing at new targets, but the results have not changed. Between this and the previous round of testing, we're able to demonstrate that we can connect to Signal's servers in several different ways:

  1. By connecting directly to the load balancer
  2. By connecting through global accelerator
  3. By connecting through a domain that's pointed to the load balancer (chat.signal.org before yesterday's changes)
  4. By connecting through a domain that's pointed to the global accelerator (chat.signal.org after yesterday's changes)

Everything you wrote is exactly right.

We also believe that several other domains remain similarly unreachable, including:

  • https://cdn.signal.org
  • https://cdn2.signal.org
  • https://storage.signal.org

Interestingly these were not initially, but they are now! (Just checked)

Here I wrote that some users in my group were experiencing issues with CDN or CDN2, but not consistently.

I have now verified that is very consistent. They show the same issue with textsecure-service.whispersystems.org.

Being chat.signal.org the new endpoint in every up to date release, as far as I can tell, I am correct in thinking that the fact that we can't properly process media is related to these other addresses?

What's exceptionally strange about that situation is that those domains are not only not behind our load balancer or global accelerator, they're hosted by different providers entirely (i.e. not AWS). That said, you wrote:

As explained here this happens across different ISPs most of which have entirely separate network infrastructure (they don't even connect to the same IXPs.

I promise I understand that, and I'm just as puzzled as you are! Not only do we have multiple providers in Italy having difficulty connecting, they're having difficulty connecting to multiple different services on our end. That's very, very strange.

At this point, I don't want to speculate too much about what's going on (the point is still to form hypotheses and test them with evidence), but I do want to assure you that I am and have been reading everything carefully and I hear you. With that in mind, it may be that these mobile network operators share a common vendor for (again, speculating wildly) spam filtering software, and that third-party vendor has mistakenly added these domains to the list. I don't mean to assert that's what is happening, but I don't think it's impossible.

Spam filtering was one of the first hypothesis, if not that when I used packet inspection to see what was going on I immediatley noticed that the behaviour was the one of a DPI firewall.

Having access to a friend engineer on one of the interested networks I asked him to ask this internally and he informally relayed that those packet were not being filtered as far as they were aware. I realize now that this might not being the case because that network is very complex, those engineers (I know for a fact) can't vouch for the whole network, but just for specific subregions.

Also I can add this: on very specific cells on mobile data with MY provider, text textsecure-service.whispersystems.org doesn't fail. I can't run this test on the whole group of people involved thou.

Let me know if you might need any details on these things I wasn't very specific about before. I don't know which one they could be, but if you have questions, please ask.

Again, the thing that would really, really help us understand what's going on here is traceroute output. Would you be willing to send the output of the following to me directly?

  • traceroute textsecure-service.whispersystems.org
  • traceroute Signal-Production-956696213.us-east-1.elb.amazonaws.com
  • traceroute chat.signal.org
  • traceroute ac88393aca5853df7.awsglobalaccelerator.com
  • traceroute 76.223.92.165
  • traceroute 13.248.212.111

If so, please feel free to email me at [my first name] at signal.org!

I'm going to collect the data and email you as soon as possible. Is getting late here in Italy and I'm on family duty so I'm afraid I'm not sure if I'm going to be able to do it right away!

Once again, thanks for your patience and help in debugging all this. I know it's been a long road, and I understand that Signal is just plain not working for you and your contacts right now. We'll get through this.

Thanks to you!

jon-signal commented 2 years ago

Before we move on, I feel the need to apologize, I get from your response that somehow I conveyed the wrong message to you and implied something negative with my comment about what is more likely or not. That was not my intention at all, English is not my mother tongue and if I offended you I indeed apologize.

No worries at all, friend! I took no offense, but given the circumstances (this ticket changing hands from one engineer to another, the earlier mistake, and the re-treading over ground we had covered earlier in the week), I did see how it could seem like I wasn't paying attention or was just skimming the conversation so far. I hope that I haven't offended you, either!

Your English is so good that it never occurred to me that it might not be your first language ;)

I'm going to collect the data and email you as soon as possible. Is getting late here in Italy and I'm on family duty so I'm afraid I'm not sure if I'm going to be able to do it right away!

Again, no worries at all! I'll look forward to hearing from you when you have the opportunity.

Thank you for the continued research and all of the additional detail!

quaqo commented 2 years ago

Before we move on, I feel the need to apologize, I get from your response that somehow I conveyed the wrong message to you and implied something negative with my comment about what is more likely or not. That was not my intention at all, English is not my mother tongue and if I offended you I indeed apologize.

No worries at all, friend! I took no offense, but given the circumstances (this ticket changing hands from one engineer to another, the earlier mistake, and the re-treading over ground we had covered earlier in the week), I did see how it could seem like I wasn't paying attention or was just skimming the conversation so far. I hope that I haven't offended you, either!

No absolutely I'm not offended: for what!?!

Your English is so good that it never occurred to me that it might not be your first language ;)

Oh thanks.

I'm going to collect the data and email you as soon as possible. Is getting late here in Italy and I'm on family duty so I'm afraid I'm not sure if I'm going to be able to do it right away!

Again, no worries at all! I'll look forward to hearing from you when you have the opportunity.

Thank you for the continued research and all of the additional detail!

Hear you soon then!

quaqo commented 2 years ago

Jon, I apologize for the delay. I sent you an email.

jon-signal commented 2 years ago

Friends, it's been a while since we shared an update. I regret that we don't yet have any concrete answers, but I did wanted to let you all know that we're still actively working on this. Thank you to everybody who has contributed time, attention, and debug data to this whole effort!

mpix21 commented 2 years ago

@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/

jon-signal commented 2 years ago

I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/

Thank you!

jon-signal commented 2 years ago

Friends, at this point, we're trying to get in touch with a mobile engineer at one of the affected mobile carriers. If you know somebody who works in the network engineering department at Vodafone Italy (or any of the other affected carriers), please let me know! Again, you can reach me at [my first name] at signal dot org.

orazioedoardo commented 2 years ago

OONI measurements also show several failures for AS30722.

marrco commented 2 years ago

@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/

I do have the same issue, in Italy, using ho-mobile on 2 different android phones (honor and poco). Text messages pass, but photos and audio/video calls not.

[edit - this was a wrong info: A person living near me using a Samsung phone and the same ho-mobile carrier is reporting NOT having that issue, though I could not test that personally.]

Activating a WireGuard tunnel, removing from mem and and restarting signal does the trick.

mpix21 commented 2 years ago

@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/

I do have the same issue, in Italy, using ho-mobile on 2 different android phones (honor and poco). Text messages pass, but photos and audio/video calls not. A person living near me using a Samsung phone and the same ho-mobile carrier is reporting NOT having that issue, though I could not test that personally.

Activating a WireGuard tunnel, removing from mem and and restarting signal does the trick.

I'd put my hands on the samsung phone and re-run the checks if I were you tbh, I had to discuss with quite few people saying it was working before they actually ran the exact steps to reproduce - they were either looking only at the messages or running under wifi with a different ISP and/or VPN.

Regarding WireGuard, that's ok - that means you're "bypassing" the Ho-mobile network inspection and they can't touch the TLS handshake at all. All VPNs would fix the issue, as they all encapsulate the data in an encrypted tunnel invisible to the Ho Mobile network inspection tools.

marrco commented 2 years ago

@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/

I do have the same issue, in Italy, using ho-mobile on 2 different android phones (honor and poco). Text messages pass, but photos and audio/video calls not. A person living near me using a Samsung phone and the same ho-mobile carrier is reporting NOT having that issue, though I could not test that personally. Activating a WireGuard tunnel, removing from mem and and restarting signal does the trick.

I'd put my hands on the samsung phone and re-run the checks tbh, I had to discuss with quite few people saying it was working before they actually ran the exact steps to reproduce - they were either looking only at the messages or running under wifi with a different ISP and/or VPN.

Regarding WireGuard, that's ok - that means you're "bypassing" the Ho-mobile network inspection and they can't touch the TLS handshake at all. All VPNs would fix the issue, as they all encapsulate the data in an encrypted tunnel invisible to the Ho Mobile network inspection tools.

you're right, I asked to re-test with wifi off. I can confirm that only texts are working but images don't pass. With ho.mobile and a Samsung phone, in Italy.

I don't have an exact date, but I'd say the issue appeared around mid December.

paride commented 2 years ago

I can confirm the behavior (messages pass, attachments do not) and the approximate time the outage began. I can reproduce the issue on a Xiaomi phone. I tried configuring the phone with the Xiaomi-specific APN settings [1] but the change made no difference.

[1] https://supporto.ho-mobile.it/t5/Soluzioni-per-configurare/Configuro-internet-sul-mio-smartphone/ta-p/112

lukastribus commented 2 years ago

It's SNI based blocking, if the SNI in the client_hello contains anything within whispersystems.org, the session is killed at TCP level.

That's why chat.signal.org works.

This has nothing to do with routing, traceroute or DNS resolution.

Use openssl's servername argument to test different SNI values:

openssl s_client -connect textsecure-service.whispersystems.org:443 -servername google.com

SNI values: google.com --> WORKS chat.signal.org --> WORKS textsecure-service1.whispersystems.org --> FAIL textsecure-service.whispersystems1.org --> WORKS

Without SNI (-noservername) it also WORKS fine.

It appears TTL could also be a factor, further tests are necessary.

lukastribus commented 2 years ago

Reverse check, lets connect to Google.com instead of Signal, but use whispersystems.org SNI:

Connecting to google.com:443 with SNI www.whispersystems.org fails:

openssl s_client -connect google.com:443 -servername www.whispersystems.org

Connecting to google.com:443 with SNI www.whispersystems1.org works:

openssl s_client -connect google.com:443 -servername www.whispersystems1.org

This should clear any shadow of doubt that this is a SNI based block of the entire whispersystems.org domain.

mpix21 commented 2 years ago

it's not only whole *.whispersystems.org domain, check out also

cdn.signal.org
cdn2.signal.org
storage.signal.org

so it's also some signal.org subdomains FYI

lukastribus commented 2 years ago

Correct, and for the record after cdn any single character is included, so cdn9.signal.org cdnA.signal.org cdnz.signal.org is all blocked.

jon-signal commented 2 years ago

It's SNI based blocking, if the SNI in the client_hello contains anything within whispersystems.org, the session is killed at TCP level.

Hello, and thanks for this additional debugging! While I realize I should have posted a more complete summary of the situation sooner, this matches our current diagnosis and understanding of the problem.

I think it's fair to say we're firmly out of the "diagnosis" phase of things and squarely into the "get it fixed" phase. I emphasize that our main goal right now is to get in touch with somebody at one of the affected mobile carriers; if you have any points of contact you can share, please email me at [my first name] at signal dot org.