Closed quaqo closed 2 years ago
Another debug log with more data:
https://debuglogs.org/1621f1fc967624e2d1cc84bbbd346a54bdfc9aa068a9c4f247c4b9684a00ae4b
I verified indipendently that on 3 different data providers connections to:
https://textsecure-service.whispersystems.org/
Fail. (outside Signal app also). They time out.
Connections to: https://chat.signal.org are OK.
Thanks for the extra info. I've passed it onto the team. I'm curious if you can test with an iOS device to see if it's Android specific or something at a higher level.
Edit: Additionally, is there anything unique about the setup of the devices like proxies/vpns/etc?
Thanks for the extra info. I've passed it onto the team. I'm curious if you can test with an iOS device to see if it's Android specific or something at a higher level.
Edit: Additionally, is there anything unique about the setup of the devices like proxies/vpns/etc?
Thanks. Actually that's what I wanted to try, but we don't have access to iOS apart from iPads (I'm "debugging" on behalf of 9 other people: I'm the only tech savvy one and the one responsible for making the other users switch to Signal... :-)).
I guided them all and the results is the same: even thou it's resolved to the same IP, https://textsecure-service.whispersystems.org/, is not accessible from Data, just WiFi.
The thing is that we all have different mobile providers and fixed WiFi providers... So it's very very strange that this is constant.
I tried setting up a VPN service between my phone and my home network and it works, so I'm inclined to say it's not an android problem, but I have no way to confirm it.
It could be solved as I saw a commit that switches the domains.
Yet I emphasize how strange it is that "https://textsecure-service.whispersystems.org/" is not accessible on such a different range of devices and providers. I think I'm missing something... If you have some other tests to run to propose, please do. I'll coordinate with the other "non techie" people that I know are experiencing this issue.
I tried setting up a VPN service between my phone and my home network and it works, so I'm inclined to say it's not an android problem, but I have no way to confirm it.
Tried the other way around. Linux PC using the phone's connection with Wireshark on it.
TCP connection to:
As an interesting update: I have ONE of the 9 people do the same test and he can't connect to https://textsecure-service.whispersystems.org/ even on WiFi (nor to https://chat.signal.org), BUT they can send messages (only on WiFi).
So maybe it's not related. Are there other endpoints to test?
UPDATE: I stand corrected, he didn't properly run the test. He can't connect to both on Mobile, but he can connect on WiFi.
Ok we tried:
On most phones TEXTSECURE-SERVICE is blocked, on others CDN or CDN2, but only on Mobile. They work on WiFi.
Edit: Additionally, is there anything unique about the setup of the devices like proxies/vpns/etc?
No.
The only commonalities are that different mobile data providers are not letting https://textsecure-service.whispersystems.org/ (or 1 case also https://chat.signal.org) thru.
Apart from that I can't see anything else for the time being. I'm open to suggestions.
Btw, tried to contact other friends my "group" just got bigger, we're now 11 people that have this problem: these two people just assumed Signal was out since a few days and resorted to WhatsApp...
An additional data point. I managed to find 2 iOS users: can't access the URL, but I see that iOS uses chat.signal.org.
apart from iPads
ipads would still work as a valid test to see if it's networking level or android level.
It could be solved as I saw a commit that switches the domains.
You can try out the 5.28.x beta via https://community.signalusers.org/t/beta-feedback-for-the-upcoming-android-5-28-release/39659
Tried the other way around. Linux PC using the phone's connection with Wireshark on it.
Are you able to run a DNS lookup on textsecure while tethered to your mobile network and see what comes back? Does the same IP come back that you tried or something else?
I'm going to escalate again.
apart from iPads
ipads would still work as a valid test to see if it's networking level or android level.
Please see this. Seems a networking level. But very common. At least geographically (I can't find any comments on the Internet about people with the same issue!).
It could be solved as I saw a commit that switches the domains.
You can try out the 5.28.x beta via https://community.signalusers.org/t/beta-feedback-for-the-upcoming-android-5-28-release/39659
Ok will do.
Tried the other way around. Linux PC using the phone's connection with Wireshark on it.
Are you able to run a DNS lookup on textsecure while tethered to your mobile network and see what comes back? Does the same IP come back that you tried or something else?
I did. They resolve
On Mobile:
Addresses: 76.223.92.165
13.248.212.111
On WiFi:
Addresses: 13.248.212.111
76.223.92.165
I'm going to escalate again.
It could be solved as I saw a commit that switches the domains.
You can try out the 5.28.x beta via https://community.signalusers.org/t/beta-feedback-for-the-upcoming-android-5-28-release/39659
Ok will do.
Beta 5.28.3 -> It worked a few minutes. Then it stopped again.
So maybe the routing issue is not the only/main culprit.
Attached debug log:
https://debuglogs.org/94c6055ebe1921c97d8d96608c146863b9e937e39691239dd090c70e022bf531
Looking into it from our side, we can't find much wrong. Everything appears to be working as expected. ~Is everyone you are testing with on a Xiaomi or Huawei device or is it across vendors as well?~ (You mentioned the issue with connecting on iOS directly so not a vendor thing.) Would you be able to let us know what carriers you are encountering issues with?
Sure. Vendors are
Carriers are:
I have to add one carrier that WORKS:
This carrier started out as an ISP and it's the only one of which I can see the routing happening on the same network as I happen to have a WiFi connection also.
It sounds like you might have access to packet capture to debug this; I'm wondering if we're doing a TCP handshake at all, or if it's failing as part of that. If you're comfortable with it, would you be willing to capture and provide a pcap? Given the name resolution, this BPF filter should narrow things down: "host 76.223.92.165 or host 13.248.212.111". I'm specifically interested in whether a SYN/ACK packet is coming back from the initial SYN you're sending.
If the SYN/ACK is coming back (and again, if your test platform has the capability), it would also be very interesting to see the output of this:
echo '' | openssl s_client -connect chat.signal.org:443 -servername chat.signal.org -showcerts
I'm specifically interested in whether a SYN/ACK packet is coming back from the initial SYN you're sending.
It does but then it doesn't ACK to the Client Hello packet. This is what happens:
If the SYN/ACK is coming back (and again, if your test platform has the capability), it would also be very interesting to see the output of this
$ echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
CONNECTED(00000004)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 349 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
If I try with chat.signal.org, after the "Client Hello" I get an ACK back and then "Server Hello" and the whole SSL handshake takes place and everybody is happy.
As an addendum it seems that with the beta I have to rectify this statement:
Beta 5.28.3 -> It worked a few minutes. Then it stopped again.
Messages are now erratically working (but most of the time they are, so chat.signal.org does make a difference), but received media are not showing properly (they're showed blurred, but they're not properly downloaded until there's a WiFi connection - "Download only on WiFi" et similar options are not active, just to be clear).
Sending media doesn't work at all on mobile data.
Went back and forth a bit with AWS; we believe there is some issue in the path of its routing of the requests through to our backend servers. We utilize a Global Accelerator to pull cross-region traffic over AWS' backbone to our serving region, then an Elastic Load Balancer to choose which server gets the request, and our current hypothesis is that the issue lies in one of these pieces of infra.
We should be able to diagnose which by bypassing the Global Accelerator for some requests but not others. IIUC, the following test should help us narrow down to a specific subsystem:
# Repeat test from before; by hitting the textsecure-service.whispersystems.org FQDN, we're going through GA
echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
# Now hit IPs of the ELB without routing them through GA. If this works, we believe the issue is with GA; if it fails, ELB appears to be the cause
echo '' | openssl s_client -connect Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 -servername textsecure-service.whispersystems.org -showcerts
If possible, could we get an actual PCAP file associated with a failure? AWS support is interested in checking out things in more detail (IP/TCP flags/options/ttl/etc). I will note, however, that this could contain your NIC's IP and MAC address; if you're interested in not making those public, totally understand.
Went back and forth a bit with AWS; we believe there is some issue in the path of its routing of the requests through to our backend servers. We utilize a Global Accelerator to pull cross-region traffic over AWS' backbone to our serving region, then an Elastic Load Balancer to choose which server gets the request, and our current hypothesis is that the issue lies in one of these pieces of infra.
I see. Thanks for taking me seriously and looking so promptly into this! I already have some of the people I mentioned dismissing the use of Signal because of this, and as a long time user that finally got many people on board this year it kinda puts me down! Also I want to use my messenger of choice!
We should be able to diagnose which by bypassing the Global Accelerator for some requests but not others. IIUC, the following test should help us narrow down to a specific subsystem:
# Repeat test from before; by hitting the textsecure-service.whispersystems.org FQDN, we're going through GA echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
This connection times out as before. Nothing back.
# Now hit IPs of the ELB without routing them through GA. If this works, we believe the issue is with GA; if it fails, ELB appears to be the cause echo '' | openssl s_client -connect Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 -servername textsecure-service.whispersystems.org -showcerts
This works! I get a good SSL handshake and the correct self-signed certificate is presented back to me.
If possible, could we get an actual PCAP file associated with a failure? AWS support is interested in checking out things in more detail (IP/TCP flags/options/ttl/etc). I will note, however, that this could contain your NIC's IP and MAC address; if you're interested in not making those public, totally understand.
As a Signal user I value my privacy. I understand the need of this, but if I could back off from doing this while still being of help I'd appreciate it.
Thank you.
I seem to have the same problems described by @quaqo of messages erraticaly working with the provider 'Ho Mobile'. I didn't go so far as to collect network logs, but I suspect it's the same problem. By using another carrier (Fastweb Mobile) everything works fine.
@gram-signal or @cody-signal, are you able to provide any update? Thank you!
Hello! I'm Jon from the server engineering team. @gram-signal is out this week, so I'll be taking over the server side of things from here.
Thank you for the additional reports and detailed debugging. It sounds like we have a reasonably clear picture of what's going on. We'll be taking steps today to resolve the issue, and I'll report back shortly.
Hello! I'm Jon from the server engineering team. @gram-signal is out this week, so I'll be taking over the server side of things from here.
Hi Jon! Right.
Thank you for the additional reports and detailed debugging. It sounds like we have a reasonably clear picture of what's going on. We'll be taking steps today to resolve the issue, and I'll report back shortly.
Oh, that's great news! Thanks!
Have a nice day!
Friends, we're still working to resolve the root cause of the issue, but for now, we've routed things around what we believe to be problematic piece of infrastructure for Italian users. At your convenience, please let us know if things seem to be working better for you.
Thank you!
Friends, we're still working to resolve the root cause of the issue, but for now, we've routed things around what we believe to be problematic piece of infrastructure for Italian users. At your convenience, please let us know if things seem to be working better for you.
Just FYI this:
echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
still fails.
Ah--I misread a critical detail earlier. My apologies. I've redirected chat.signal.org
, but not textsecure-service.whispersystems.org
. I'll revise what we're doing and check back in shortly.
Friends, thanks for your patience. We've now applied the routing change for textsecure-service.whispersystems.org
in Italy. If you could give things another try and let us know how it goes, that would be greatly appreciated.
Just tried
echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
And it still fails. chat.signal.org
is working fine instead.
Provider: 'Ho Mobile'.
Edit: attached command output. textsecure-service.whispersystems.org.log chat.signal.org.log
Just tried
echo '' | openssl s_client -connect textsecure-service.whispersystems.org:443 -servername textsecure-service.whispersystems.org -showcerts
And it still fails.
chat.signal.org
is working fine instead. Provider: 'Ho Mobile'.
I confirm it fails
I still can send messages (the last Android version uses chat.signal.org as endpoint), but I can't send or receive media. Nor interact with messages (e.g reply to) containing media.
Thank you for following up! I'm sorry to hear that things still aren't working, this is helpful information that helps us close in on the root cause.
My current hypothesis is that some mobile providers are actually blocking these specific domains, and it's not actually a routing issue at this point. To test that hypothesis, I'm afraid I have to ask for some more information and testing.
First, can you please share the output of dig +short textsecure-service.whispersystems.org
and dig +short chat.signal.org
(via the affected mobile connection, of course)? If those domains are, indeed, pointing to the same IP addresses, that strongly suggests the problem isn't in the route, but has something to do with the domain name itself. If not, there may be something with our "route around the obstacle" strategy.
Next, let's make sure that the DNS entry for our global accelerator instance is working as expected: dig +short ac88393aca5853df7.awsglobalaccelerator.com
. I expect you'll find two IP addresses in the response, but let's double-check to make sure.
To rule out an internet routing problem, let's see if we can connect directly to the GA IPs:
netcat -vz 13.248.212.111 443
netcat -vz 76.223.92.165 443
Finally, it'd be helpful to understand the route to both GA IPs:
traceroute 13.248.212.111
traceroute 76.223.92.165
I recognize that the traceroute output may be more than you're comfortable sharing in public forum; if everything else is working as expected and you'd rather not share, that's completely fine. If you're having trouble reaching the GA IPs, though, it'd be very helpful to understand the routes involved, and we can try to work out a more private channel for sharing that information. Let's cross that bridge if we get to it, though.
Whew.
Again, please accept my thanks for helping us dig into this issue! To recap, the main thrust here is that we're testing a hypothesis that the routes themselves are fine and that the same traffic traveling along the same route with different domain names will wind up with different results. With that information, we can narrow down where, in the vast internet, this problem is really happening.
Cheers!
Thank you for following up! I'm sorry to hear that things still aren't working, this is helpful information that helps us close in on the root cause.
No worries! Thanks to you for your help.
My current hypothesis is that some mobile providers are actually blocking these specific domains, and it's not actually a routing issue at this point.
Honestly, I don't think that's the problem. This explanation by @gram-signal sounded more likely:
Went back and forth a bit with AWS; we believe there is some issue in the path of its routing of the requests through to our backend servers. We utilize a Global Accelerator to pull cross-region traffic over AWS' backbone to our serving region, then an Elastic Load Balancer to choose which server gets the request, and our current hypothesis is that the issue lies in one of these pieces of infra.
Anyways, I'm in no position to verify so I'll still help debug this, but it wouldn't make sense to actively block a single domain in a geographic region (Italy) if this is what you're suggesting.
To test that hypothesis, I'm afraid I have to ask for some more information and testing.
First, can you please share the output of
dig +short textsecure-service.whispersystems.org
anddig +short chat.signal.org
(via the affected mobile connection, of course)? If those domains are, indeed, pointing to the same IP addresses, that strongly suggests the problem isn't in the route, but has something to do with the domain name itself. If not, there may be something with our "route around the obstacle" strategy.
I had already explained the situation here. chat.signal.org and textsecure-service.whispersystems.org do point to the same IPs.
What seems to have changed now (I don't know if this is part of what you did to bypass the problem) is that instead of pointing to 13.248.212.111 / 76.223.92.165, textsecure-service.whispersystems.org is CNAME for chat.signal.org which in turn resolve to a bunch of AWS nodes.
Here are the outputs:
$ dig +short textsecure-service.whispersystems.org
chat.signal.org.
3.217.34.249
52.202.220.55
54.227.133.78
52.86.184.26
52.206.227.248
34.196.10.25
3.225.46.143
52.54.79.162
$ dig +short chat.signal.org
18.235.44.161
35.171.56.38
52.22.161.181
35.175.8.205
3.221.216.251
54.165.62.59
54.227.133.78
54.147.219.6
Next, let's make sure that the DNS entry for our global accelerator instance is working as expected:
dig +short ac88393aca5853df7.awsglobalaccelerator.com
. I expect you'll find two IP addresses in the response, but let's double-check to make sure.
This does resolve to the original IPs I mentioned in my first debug round:
dig +short ac88393aca5853df7.awsglobalaccelerator.com
13.248.212.111
76.223.92.165
To rule out an internet routing problem, let's see if we can connect directly to the GA IPs:
netcat -vz 13.248.212.111 443
netcat -vz 76.223.92.165 443
Positive. As explained here, no problem connecting directly to the IPs OR to chat.signal.org
Finally, it'd be helpful to understand the route to both GA IPs:
traceroute 13.248.212.111
traceroute 76.223.92.165
I recognize that the traceroute output may be more than you're comfortable sharing in public forum; if everything else is working as expected and you'd rather not share, that's completely fine. If you're having trouble reaching the GA IPs, though, it'd be very helpful to understand the routes involved, and we can try to work out a more private channel for sharing that information. Let's cross that bridge if we get to it, though.
Whew.
Again, please accept my thanks for helping us dig into this issue! To recap, the main thrust here is that we're testing a hypothesis that the routes themselves are fine and that the same traffic traveling along the same route with different domain names will wind up with different results. With that information, we can narrow down where, in the vast internet, this problem is really happening.
Cheers!
As explained here this happens across different ISPs most of which have entirely separate network infrastructure (they don't even connect to the same IXPs.
The only common thing is geographical location as far as I can tell.
I want to stress also that this issue started on Dec, 9th and never manifested before. As far as I know it's the same day AWS had problems.
I didn't reported straight away because at first I though to give it time, and then I didn't knew so many people were affected.
When I started to receive accounts from family, friends and collegues that "the app you made me install, Signal is unreliable" I started investigating and opened the ticket here.
Thanks again for following up! There's some new information here, but you've also expressed some concerns. Let's review the situation as a whole to make sure we're on the same page, then try to move forward with the debugging process.
You wrote:
This explanation by @gram-signal sounded more likely...
...there is some issue in the path of its routing of the requests through to our backend servers.
I agree! That certainly did sound more likely until we gathered more evidence that makes it seem less likely now. I know I made a mistake earlier in updating our "route around GA" rules yesterday, but I promise I have been following and understand the conversation up to this point.
To recap, initial testing showed that you could connect to Signal's infrastructure by using chat.signal.org
or by connecting directly to the load balancer, but not by going through textsecure-service.whispersystems.org
. In other words:
Endpoint | openssl s_client -connect result |
---|---|
textsecure-service.whispersystems.org:443 |
Timeout |
Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 |
Success |
chat.signal.org:443 |
Success |
76.223.92.165:443 |
Success |
13.248.212.111:443 |
Success |
We also verified in https://github.com/signalapp/Signal-Android/issues/11839#issuecomment-992595371 that textsecure-service.whispersystems.org
resolved to:
...and those are our Global Accelerator addresses.
At this point, we made a change such that chat.signal.org
points directly to our load balancer (Signal-Production-956696213.us-east-1.elb.amazonaws.com
) instead of global accelerator (76.223.92.165
/13.248.212.111
) for users in Italy. We also changed textsecure-service.whispesystems.org
to be a CNAME
record that points to chat.signal.org
. It does appear that those changes are visible to your local nameserver, and that our domains are now pointing at the load balancer instead of the global accelerator for you.
Now, having made those changes, my understanding of the testing matrix is that we're still looking at:
Endpoint | openssl s_client -connect result |
---|---|
textsecure-service.whispersystems.org:443 |
Timeout |
Signal-Production-956696213.us-east-1.elb.amazonaws.com:443 |
Success |
chat.signal.org:443 |
Success |
76.223.92.165:443 |
Success |
13.248.212.111:443 |
Success |
In other words, we've verified that the domains in question are indeed pointing at new targets, but the results have not changed. Between this and the previous round of testing, we're able to demonstrate that we can connect to Signal's servers in several different ways:
chat.signal.org
before yesterday's changes)chat.signal.org
after yesterday's changes)We also believe that several other domains remain similarly unreachable, including:
https://cdn.signal.org
https://cdn2.signal.org
https://storage.signal.org
What's exceptionally strange about that situation is that those domains are not only not behind our load balancer or global accelerator, they're hosted by different providers entirely (i.e. not AWS). That said, you wrote:
As explained here this happens across different ISPs most of which have entirely separate network infrastructure (they don't even connect to the same IXPs.
I promise I understand that, and I'm just as puzzled as you are! Not only do we have multiple providers in Italy having difficulty connecting, they're having difficulty connecting to multiple different services on our end. That's very, very strange.
At this point, I don't want to speculate too much about what's going on (the point is still to form hypotheses and test them with evidence), but I do want to assure you that I am and have been reading everything carefully and I hear you. With that in mind, it may be that these mobile network operators share a common vendor for (again, speculating wildly) spam filtering software, and that third-party vendor has mistakenly added these domains to the list. I don't mean to assert that's what is happening, but I don't think it's impossible.
Again, the thing that would really, really help us understand what's going on here is traceroute
output. Would you be willing to send the output of the following to me directly?
traceroute textsecure-service.whispersystems.org
traceroute Signal-Production-956696213.us-east-1.elb.amazonaws.com
traceroute chat.signal.org
traceroute ac88393aca5853df7.awsglobalaccelerator.com
traceroute 76.223.92.165
traceroute 13.248.212.111
If so, please feel free to email me at [my first name] at signal.org!
Once again, thanks for your patience and help in debugging all this. I know it's been a long road, and I understand that Signal is just plain not working for you and your contacts right now. We'll get through this.
Thanks again for following up! There's some new information here, but you've also expressed some concerns. Let's review the situation as a whole to make sure we're on the same page, then try to move forward with the debugging process.
Thanks again to you Jon. Before we move on, I feel the need to apologize, I get from your response that somehow I conveyed the wrong message to you and implied something negative with my comment about what is more likely or not. That was not my intention at all, English is not my mother tongue and if I offended you I indeed apologize. I don't want to cause any misunderstanding about this!
You wrote:
This explanation by @gram-signal sounded more likely...
...there is some issue in the path of its routing of the requests through to our backend servers.
I agree! That certainly did sound more likely until we gathered more evidence that makes it seem less likely now. I know I made a mistake earlier in updating our "route around GA" rules yesterday, but I promise I have been following and understand the conversation up to this point.
To recap, initial testing showed that you could connect to Signal's infrastructure by using
chat.signal.org
or by connecting directly to the load balancer, but not by going throughtextsecure-service.whispersystems.org
. In other words:Endpoint
openssl s_client -connect
resulttextsecure-service.whispersystems.org:443
TimeoutSignal-Production-956696213.us-east-1.elb.amazonaws.com:443
Successchat.signal.org:443
Success76.223.92.165:443
Success13.248.212.111:443
Success We also verified in #11839 (comment) thattextsecure-service.whispersystems.org
resolved to:
- 76.223.92.165
- 13.248.212.111
...and those are our Global Accelerator addresses.
At this point, we made a change such that
chat.signal.org
points directly to our load balancer (Signal-Production-956696213.us-east-1.elb.amazonaws.com
) instead of global accelerator (76.223.92.165
/13.248.212.111
) for users in Italy. We also changedtextsecure-service.whispesystems.org
to be aCNAME
record that points tochat.signal.org
. It does appear that those changes are visible to your local nameserver, and that our domains are now pointing at the load balancer instead of the global accelerator for you.Now, having made those changes, my understanding of the testing matrix is that we're still looking at:
Endpoint
openssl s_client -connect
resulttextsecure-service.whispersystems.org:443
TimeoutSignal-Production-956696213.us-east-1.elb.amazonaws.com:443
Successchat.signal.org:443
Success76.223.92.165:443
Success13.248.212.111:443
Success In other words, we've verified that the domains in question are indeed pointing at new targets, but the results have not changed. Between this and the previous round of testing, we're able to demonstrate that we can connect to Signal's servers in several different ways:
- By connecting directly to the load balancer
- By connecting through global accelerator
- By connecting through a domain that's pointed to the load balancer (
chat.signal.org
before yesterday's changes)- By connecting through a domain that's pointed to the global accelerator (
chat.signal.org
after yesterday's changes)
Everything you wrote is exactly right.
We also believe that several other domains remain similarly unreachable, including:
https://cdn.signal.org
https://cdn2.signal.org
https://storage.signal.org
Interestingly these were not initially, but they are now! (Just checked)
Here I wrote that some users in my group were experiencing issues with CDN or CDN2, but not consistently.
I have now verified that is very consistent. They show the same issue with textsecure-service.whispersystems.org.
Being chat.signal.org the new endpoint in every up to date release, as far as I can tell, I am correct in thinking that the fact that we can't properly process media is related to these other addresses?
What's exceptionally strange about that situation is that those domains are not only not behind our load balancer or global accelerator, they're hosted by different providers entirely (i.e. not AWS). That said, you wrote:
As explained here this happens across different ISPs most of which have entirely separate network infrastructure (they don't even connect to the same IXPs.
I promise I understand that, and I'm just as puzzled as you are! Not only do we have multiple providers in Italy having difficulty connecting, they're having difficulty connecting to multiple different services on our end. That's very, very strange.
At this point, I don't want to speculate too much about what's going on (the point is still to form hypotheses and test them with evidence), but I do want to assure you that I am and have been reading everything carefully and I hear you. With that in mind, it may be that these mobile network operators share a common vendor for (again, speculating wildly) spam filtering software, and that third-party vendor has mistakenly added these domains to the list. I don't mean to assert that's what is happening, but I don't think it's impossible.
Spam filtering was one of the first hypothesis, if not that when I used packet inspection to see what was going on I immediatley noticed that the behaviour was the one of a DPI firewall.
Having access to a friend engineer on one of the interested networks I asked him to ask this internally and he informally relayed that those packet were not being filtered as far as they were aware. I realize now that this might not being the case because that network is very complex, those engineers (I know for a fact) can't vouch for the whole network, but just for specific subregions.
Also I can add this: on very specific cells on mobile data with MY provider, text textsecure-service.whispersystems.org doesn't fail. I can't run this test on the whole group of people involved thou.
Let me know if you might need any details on these things I wasn't very specific about before. I don't know which one they could be, but if you have questions, please ask.
Again, the thing that would really, really help us understand what's going on here is
traceroute
output. Would you be willing to send the output of the following to me directly?
traceroute textsecure-service.whispersystems.org
traceroute Signal-Production-956696213.us-east-1.elb.amazonaws.com
traceroute chat.signal.org
traceroute ac88393aca5853df7.awsglobalaccelerator.com
traceroute 76.223.92.165
traceroute 13.248.212.111
If so, please feel free to email me at [my first name] at signal.org!
I'm going to collect the data and email you as soon as possible. Is getting late here in Italy and I'm on family duty so I'm afraid I'm not sure if I'm going to be able to do it right away!
Once again, thanks for your patience and help in debugging all this. I know it's been a long road, and I understand that Signal is just plain not working for you and your contacts right now. We'll get through this.
Thanks to you!
Before we move on, I feel the need to apologize, I get from your response that somehow I conveyed the wrong message to you and implied something negative with my comment about what is more likely or not. That was not my intention at all, English is not my mother tongue and if I offended you I indeed apologize.
No worries at all, friend! I took no offense, but given the circumstances (this ticket changing hands from one engineer to another, the earlier mistake, and the re-treading over ground we had covered earlier in the week), I did see how it could seem like I wasn't paying attention or was just skimming the conversation so far. I hope that I haven't offended you, either!
Your English is so good that it never occurred to me that it might not be your first language ;)
I'm going to collect the data and email you as soon as possible. Is getting late here in Italy and I'm on family duty so I'm afraid I'm not sure if I'm going to be able to do it right away!
Again, no worries at all! I'll look forward to hearing from you when you have the opportunity.
Thank you for the continued research and all of the additional detail!
Before we move on, I feel the need to apologize, I get from your response that somehow I conveyed the wrong message to you and implied something negative with my comment about what is more likely or not. That was not my intention at all, English is not my mother tongue and if I offended you I indeed apologize.
No worries at all, friend! I took no offense, but given the circumstances (this ticket changing hands from one engineer to another, the earlier mistake, and the re-treading over ground we had covered earlier in the week), I did see how it could seem like I wasn't paying attention or was just skimming the conversation so far. I hope that I haven't offended you, either!
No absolutely I'm not offended: for what!?!
Your English is so good that it never occurred to me that it might not be your first language ;)
Oh thanks.
I'm going to collect the data and email you as soon as possible. Is getting late here in Italy and I'm on family duty so I'm afraid I'm not sure if I'm going to be able to do it right away!
Again, no worries at all! I'll look forward to hearing from you when you have the opportunity.
Thank you for the continued research and all of the additional detail!
Hear you soon then!
Jon, I apologize for the delay. I sent you an email.
Friends, it's been a while since we shared an update. I regret that we don't yet have any concrete answers, but I did wanted to let you all know that we're still actively working on this. Thank you to everybody who has contributed time, attention, and debug data to this whole effort!
@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/
I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/
Thank you!
Friends, at this point, we're trying to get in touch with a mobile engineer at one of the affected mobile carriers. If you know somebody who works in the network engineering department at Vodafone Italy (or any of the other affected carriers), please let me know! Again, you can reach me at [my first name] at signal dot org.
OONI measurements also show several failures for AS30722.
@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/
I do have the same issue, in Italy, using ho-mobile on 2 different android phones (honor and poco). Text messages pass, but photos and audio/video calls not.
[edit - this was a wrong info: A person living near me using a Samsung phone and the same ho-mobile carrier is reporting NOT having that issue, though I could not test that personally.]
Activating a WireGuard tunnel, removing from mem and and restarting signal does the trick.
@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/
I do have the same issue, in Italy, using ho-mobile on 2 different android phones (honor and poco). Text messages pass, but photos and audio/video calls not. A person living near me using a Samsung phone and the same ho-mobile carrier is reporting NOT having that issue, though I could not test that personally.
Activating a WireGuard tunnel, removing from mem and and restarting signal does the trick.
I'd put my hands on the samsung phone and re-run the checks if I were you tbh, I had to discuss with quite few people saying it was working before they actually ran the exact steps to reproduce - they were either looking only at the messages or running under wifi with a different ISP and/or VPN.
Regarding WireGuard, that's ok - that means you're "bypassing" the Ho-mobile network inspection and they can't touch the TLS handshake at all. All VPNs would fix the issue, as they all encapsulate the data in an encrypted tunnel invisible to the Ho Mobile network inspection tools.
@jon-signal I guess somebody else in Italy is having the same issue as there's a detailed write-up here: https://www.reddit.com/r/signal/comments/rzvx9n/signal_app_being_filteredintercepted_by_italian/
I do have the same issue, in Italy, using ho-mobile on 2 different android phones (honor and poco). Text messages pass, but photos and audio/video calls not. A person living near me using a Samsung phone and the same ho-mobile carrier is reporting NOT having that issue, though I could not test that personally. Activating a WireGuard tunnel, removing from mem and and restarting signal does the trick.
I'd put my hands on the samsung phone and re-run the checks tbh, I had to discuss with quite few people saying it was working before they actually ran the exact steps to reproduce - they were either looking only at the messages or running under wifi with a different ISP and/or VPN.
Regarding WireGuard, that's ok - that means you're "bypassing" the Ho-mobile network inspection and they can't touch the TLS handshake at all. All VPNs would fix the issue, as they all encapsulate the data in an encrypted tunnel invisible to the Ho Mobile network inspection tools.
you're right, I asked to re-test with wifi off. I can confirm that only texts are working but images don't pass. With ho.mobile and a Samsung phone, in Italy.
I don't have an exact date, but I'd say the issue appeared around mid December.
I can confirm the behavior (messages pass, attachments do not) and the approximate time the outage began. I can reproduce the issue on a Xiaomi phone. I tried configuring the phone with the Xiaomi-specific APN settings [1] but the change made no difference.
It's SNI based blocking, if the SNI in the client_hello contains anything within whispersystems.org
, the session is killed at TCP level.
That's why chat.signal.org works.
This has nothing to do with routing, traceroute or DNS resolution.
Use openssl's servername argument to test different SNI values:
openssl s_client -connect textsecure-service.whispersystems.org:443 -servername google.com
SNI values: google.com --> WORKS chat.signal.org --> WORKS textsecure-service1.whispersystems.org --> FAIL textsecure-service.whispersystems1.org --> WORKS
Without SNI (-noservername) it also WORKS fine.
It appears TTL could also be a factor, further tests are necessary.
Reverse check, lets connect to Google.com instead of Signal, but use whispersystems.org
SNI:
Connecting to google.com:443 with SNI www.whispersystems.org fails:
openssl s_client -connect google.com:443 -servername www.whispersystems.org
Connecting to google.com:443 with SNI www.whispersystems1.org works:
openssl s_client -connect google.com:443 -servername www.whispersystems1.org
This should clear any shadow of doubt that this is a SNI based block of the entire whispersystems.org domain.
it's not only whole *.whispersystems.org
domain, check out also
cdn.signal.org
cdn2.signal.org
storage.signal.org
so it's also some signal.org subdomains FYI
Correct, and for the record after cdn any single character is included, so cdn9.signal.org
cdnA.signal.org
cdnz.signal.org
is all blocked.
It's SNI based blocking, if the SNI in the client_hello contains anything within whispersystems.org, the session is killed at TCP level.
Hello, and thanks for this additional debugging! While I realize I should have posted a more complete summary of the situation sooner, this matches our current diagnosis and understanding of the problem.
I think it's fair to say we're firmly out of the "diagnosis" phase of things and squarely into the "get it fixed" phase. I emphasize that our main goal right now is to get in touch with somebody at one of the affected mobile carriers; if you have any points of contact you can share, please email me at [my first name] at signal dot org.
Bug description
Since a few days for me, my wife, a bunch of friends and family members on different mobile networks in Italy, Signal stopped working. As soon as the device is connected to WiFi, messages gets routed, otherwise there's no communication in or out.
A friend even tried to uninstall and reinstall signal and it didn't let him re-register, saying "no network".
This happens on a range of devices with different vendors (Samsung, Xiaomi, Huawei) Android versions (8, 9, 10 and 11) and different data and WiFi providers. The only common thing is the location (Italy) and the app version (5.27.13).
Steps to reproduce
Actual result: Spinning icon untill connected to WiFi. No messages delivered in or out. Expected result: Messages should be delivered.
Device info
Device: Samsung, Xiaomi, Huawei (various) Android version: 8, 9, 10 and 11 Signal version: 5.27.13
Link to debug log
I attach one of the devices debug log, the errors are the same across all devices (I think the SSL handshake fails).
https://debuglogs.org/b8d5664acd597b3447f7c36a53fd91161b7420c611a149c8b2a86090220c85fb