ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
758 stars 142 forks source link

Web Connectivity LTE: AS30722: web.wechat.com: generic_timeout_error during dnslookup #2292

Closed bassosimone closed 8 months ago

bassosimone commented 2 years ago

TL;DR The Web Connectivity LTE implementation reveals a bug in the implementation of "Vodafone rete sicura", a DNS interception for security service running in home routers inside AS30722 (Vodafone Italia). The bug is that the router does not correctly propagate a DNS response with no error and no answers for AAAA for web.wechat.com. This fact causes every AAAA query I have run to systematically timeout. Sometimes this also affects getaddrinfo, which however is fine in many cases, albeit slow, because it times out AAAA and just return A addresses.


I am going to document why, when running measurements from my system and AS (AS30722), I often times see a generic_timeout_error when resolving web.wechat.com with Web Connectivity LTE. While this problem may just be my problem, it still seems useful to investigate and attempt to explain what happens.

Let us start with a Web Connectivity LTE measurement which times out during getaddrinfo. Here's another measurement where the AAAA query times out. While those two measurements seem unrelated, I have "Vodafone rete sicura" enabled. This is a feature where my router intercepts all DNS queries regardless of the UDP/IP destination endpoint.

My first conclusion is that getaddrinfo, as implemented on my macOS, and OONI's DNS-over-UDP resolver are fundamentally different implementations, however we end up observing the same issue: a timeout. Therefore, this issue does not seem to depend on the client-side implementation.

Now, from the second measurement, we can speculate that resolving AAAA is the main issue. We can gain more confidence on this topic by running host from the command line and capturing packets:

% host web.wechat.com
web.wechat.com is an alias for web1.wechat.com.
web1.wechat.com has address 203.205.251.163
web1.wechat.com has address 203.205.251.169
;; connection timed out; no servers could be reached
;; connection timed out; no servers could be reached

Here's the corresponding packet capture (output edited for readability):

% tshark -r wechat-dns-apple-vodafone.pcapng

    1   0.000000  192.168.1.4  192.168.1.1   DNS 74 64 \
        Standard query 0x4b93 A web.wechat.com

    2   0.012312  192.168.1.1  192.168.1.4   DNS 135 64 \
        Standard query response 0x4b93 A web.wechat.com \
        CNAME web1.wechat.com A 203.205.251.163 A 203.205.251.169

    3   0.012740  192.168.1.4  192.168.1.1   DNS 75 64 \
        Standard query 0x870e AAAA web1.wechat.com

    4   5.017725  192.168.1.4  192.168.1.1   DNS 75 64 \
        Standard query 0x870e AAAA web1.wechat.com

    5  10.018615  192.168.1.4  192.168.1.1   DNS 75 64 \
         Standard query 0x0490 MX web1.wechat.com

    6  15.023369  192.168.1.4  192.168.1.1   DNS 75 64 \
        Standard query 0x0490 MX web1.wechat.com

    7  20.953752  192.168.1.1  192.168.1.4   DNS 75 64 \
        Standard query response 0x870e Server failure AAAA web1.wechat.com

So, we see that, interestingly, queries are issued serially. The query for A completes. It is also interesting to note that the query for AAAA now queries for web1.wechat.com (the CNAME) as opposed to web.wechat.com. It's also interesting to see a timeout after five seconds, which causes a retransmission. It' also interesting to see MX queries after the timeout occurs again (after 10 seconds) and again (after 15 seconds). The real answer only arrives after 20 seconds and it's very interesting to see that we're getting a SERVFAIL.

What OONI's DNS-over-UDP resolver does is slightly different, because we query in parallel, though we can format the measurement where AAAA failed to look like the above packet capture:

1   0.001701  ????:??  8.8.8.8:53  DNS  0  0 \
     Standard query 0x???? A web.wechat.com  

2  0.349347  8.8.8.8:53  ????:??  DNS  0  0 \
    Standard query response 0x???? A web.wechat.com \
    CNAME web1.wechat.com. A 203.205.251.163 A 203.205.251.169

3  0.001748   ????:??  8.8.8.8:53  DNS  0  0 \
     Standard query 0x???? AAAA web.wechat.com  

<< query 3 timeout after 5.00277 s >>

I ran a couple of dnsping measurements with these report IDs:

By looking at these measurements, I can confirm the pattern where the AAAA query for web.wechat.com will always timeout when using the DNS-over-UDP resolver. But, let's not be fooled by us using a custom resolver. As I mentioned before, because of "Vodafone rete sicura", my router is intercepting all my queries anyway.

In fact, to provide more evidence about "Vodafone rete sicura", let's inspect who is answering our queries:

% ./miniooni -i dnslookup://whoami.v4.powerdns.org urlgetter -O ResolverURL=udp://8.8.8.8:53
[      0.000082] <info> Current time: 2022-09-12 09:06:03 CEST
...
[      0.823970] <info> [1/1] running with input: dnslookup://whoami.v4.powerdns.org
[      0.846707] <info> submitting measurement to OONI collector; please be patient...
[      0.877531] <info> New reportID: 20220912T070604Z_urlgetter_IT_30722_n1_T8cioKFlbWVSTofE
...

Which produces this measurement, whose queries contains:

Screenshot 2022-09-12 at 09 08 04

So, yeah, Vodafone is indeed hijacking queries and answering to them.

Now, it would be interesting to disable "Vodafone rete sicura" and see what happens. It is also relevant to ask the question whether this fact may be causing false positives for some other users. It would also be interesting to see if we can observe the same problem using Web Connectivity v0.4.x.

So, measurements run today between 06:30 AM UTC and 07:21 AM UTC are most likely all run by me while I was authoring this issue. These measurement look like:

Screenshot 2022-09-12 at 09 23 31

There is clear clustering where Web Connectivity LTE returns DNS-based blocking where Web Connectivity does not return any form of blocking. Now, in light of this oddity I documented above, I need to figure out which of them is right about the DNS results being anomalous.

So, the reason why LTE always says there's a DNS anomaly is because AAAA always times out. Because Web Connectivity v0.4.x does not include an UDP resolver, we don't see this anomaly. Generally, getaddrinfo is able to conclude that we're good by timing out the AAAA subquery and returning results. Hence v0.4.x is always green.

Let us now disable "Vodafone rete sicura" and run dnsping again.

So, now that I have disabled this COMPLETELY FINE FEATURE 🔥 of my Internet access, I see a different behavior where I receive a dns_no_answer from AAAA queries for web.wechat.com. See for example, this Web Connectivity LTE measurement, this urlgetter measurement showing that now Google answers our queries and this dnsping measurement showing no timeouts.

Now I need to go back to a lake of sadness and re-enable "Vodafone rete sicura" again. This feature is actually useful to me because it is a nice integration testing for DNS middleboxes. 🥶

To conclude this investigation, we need to determine whether flagging these measurements as anomalies is a bug or a feature of Web Connectivity LTE. I will revisit this topic soon and perhaps discuss with colleagues. For now, I will not change the implementation and special case for this "Vodafone rete sicura" bug.

Update: oh, I realized that probably "rete sicura" is just a rule that DNATs traffic to port 53 of the home router and so most likely the real issue is in the implementation of the home router.

bassosimone commented 8 months ago

I have retested the same website using Web Connectivity v0.5. Here's the measurement. It now says okay because we use a classic linear analysis (i.e., only follow getaddrinfo) to produce results that are backwards compatible. It still says that it's not super convinced by this timeout, and I think there's a lesson here. We should be careful in saying that any anomaly leads to final anomaly but we should also reckon that these anomalies reveal middleboxes. Hard to reach out to a final conclusion about what is 100% right to do here, so I guess I'll just increment the good enough counter and move on. 🦊