ooni / probe

OONI Probe network measurement tool for detecting internet censorship
https://ooni.org/install
BSD 3-Clause "New" or "Revised" License
754 stars 142 forks source link

android: properly handle NXDOMAIN errors #2029

Open hellais opened 2 years ago

hellais commented 2 years ago

We noticed that certain NXDOMAIN errors are not properly handled by the engine.

Here is a sample measurement that results in an unknown_failure: https://explorer.ooni.org/measurement/20220215T025431Z_webconnectivity_HK_9231_n1_oU1CBKrQbHFoou5h?input=http%3A%2F%2Fwww.hongkongwatch.org

bassosimone commented 2 years ago

It seems this problem is interesting enough to deserve some in-depth level of double checking. We need to figure out whether there are some specifics that make it work somewhere or whether it's all broken.

Let's use as test case the https://thefoobarbay.com/ URL and web_connectivity as the experiment. From my network, AS30722, it's pretty clear that the domain does not exist. So, it should be a good proxy of the original issue.

So, the first objective is to figure our in which platforms does it happen.

software platform engine_version result report
ooniprobe-android android 3.14.0-beta unknown failure #
miniooni windows 3.15.0-alpha nxdomain #
miniooni linux 3.15.0-alpha nxdomain #
miniooni darwin 3.15.0-alpha nxdomain #
ooniprobe darwin 3.14.0-beta nxdomain #
ooniprobe linux 3.14.0-beta nxdomain #
ooniprobe windows 3.14.0-beta nxdomain #
ooniprobe-ios ios 3.14.0-beta nxdomain #

So, this seems to be an Android-only problem. This begs two questions: (1) what is the root cause and (2) how to test it.

bassosimone commented 2 years ago

So, debugging this issue is not immediately straightforward. I think it makes sense for me to explain how I did it.

The first thing to do is to start an emulator using Android studio.

Then, you need root access (IIUC this is mainly to be able to write the disk):

adb root

Then, you need to compile miniooni for Android. In my case, I'll compile for Android/i686 because my emulator is using this architecture. We also need to ensure we're linking with libc. There's two reasons why we need to do that. The first reason is that we're trying to debug/test getaddrinfo, so we need to be using it. The second, lesser reason is that miniooni compiled just for linux/386 does not work. This seems to be related with the Go resolver, but I am not sure. (Since this is a sidetrack, I'll try to investigate this specific aspect later, once I've understood the main issue.)

GOOS=android GOARCH=386 CGO_ENABLED=1 CC=$HOME/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android30-clang LD=$HOME/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/ld  go build -v ./internal/cmd/miniooni

Then you need to copy the miniooni binary:

$HOME/sdk/ooni-android/platform-tools/adb push miniooni /cache

Then you need to shell into Android:

~/sdk/ooni-android/platform-tools/adb shell

At this point it's straightforward:

cd /cache
chmod +x miniooni
./miniooni -i https://thefoobarbay.com web_connectivity

And here's the measurement: https://explorer.ooni.org/measurement/20220221T135225Z_webconnectivity_IT_30722_n1_ersvWVUkwEUJf22g?input=https%3A%2F%2Fthefoobarbay.com

So, the nice fact about this is that we can reproduce using miniooni. So the dev cycle here is much faster.

bassosimone commented 2 years ago

To gather more information about what is happening, here's a simple test program:

#include <sys/types.h>

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <netdb.h>

int main() {
    struct addrinfo *res;
    struct addrinfo hints;
    memset(&hints, 0, sizeof (hints));
    hints.ai_flags |= AI_CANONNAME;
    int r = getaddrinfo("thefoobarbay.com", NULL, &hints, &res);
    if (r != 0) {
        fprintf(stderr, "failure: %d %s\n", r, gai_strerror(r));
        exit(1);
    }
    struct addrinfo *aip = res;
    for (; aip != NULL; aip = aip->ai_next) {
        fprintf(stderr, "- %p\n", aip);
    }
    freeaddrinfo(res);
    return 0;
}

We can compile using

~/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android30-clang -Wall -Wextra getaddrinfo.c

After copying and running on the device, as we did above, we obtain:

./a.out
failure: 7 No address associated with hostname

Now, the interesting thing is that, when running the same small program locally on GNU/Linux, I get:

./a.out 
failure: -2 Name or service not known

If we read the resolv.conf used by Android in netdb.h, we see:

/*
 * Error return codes from getaddrinfo()
 */
#define EAI_ADDRFAMILY   1      /* address family for hostname not supported */
#define EAI_AGAIN        2      /* temporary failure in name resolution */
#define EAI_BADFLAGS     3      /* invalid value for ai_flags */
#define EAI_FAIL         4      /* non-recoverable failure in name resolution */
#define EAI_FAMILY       5      /* ai_family not supported */
#define EAI_MEMORY       6      /* memory allocation failure */
#define EAI_NODATA       7      /* no address associated with hostname */
#define EAI_NONAME       8      /* hostname nor servname provided, or not known */
#define EAI_SERVICE      9      /* servname not supported for ai_socktype */
#define EAI_SOCKTYPE    10      /* ai_socktype not supported */
#define EAI_SYSTEM      11      /* system error returned in errno */
#define EAI_BADHINTS    12      /* invalid value for hints */
#define EAI_PROTOCOL    13      /* resolved protocol is unknown */
#define EAI_OVERFLOW    14      /* argument buffer overflow */
#define EAI_MAX         15

whereas, if we read my GNU/Linux system's netdb.h, we see:

/* Error values for `getaddrinfo' function.  */
# define EAI_BADFLAGS     -1    /* Invalid value for `ai_flags' field.  */
# define EAI_NONAME       -2    /* NAME or SERVICE is unknown.  */
# define EAI_AGAIN        -3    /* Temporary failure in name resolution.  */
# define EAI_FAIL         -4    /* Non-recoverable failure in name res.  */
# define EAI_FAMILY       -6    /* `ai_family' not supported.  */
# define EAI_SOCKTYPE     -7    /* `ai_socktype' not supported.  */
# define EAI_SERVICE      -8    /* SERVICE not supported for `ai_socktype'.  */
# define EAI_MEMORY       -10   /* Memory allocation failure.  */
# define EAI_SYSTEM       -11   /* System error returned in `errno'.  */
# define EAI_OVERFLOW     -12   /* Argument buffer overflow.  */
# ifdef __USE_GNU
#  define EAI_NODATA      -5    /* No address associated with NAME.  */
#  define EAI_ADDRFAMILY  -9    /* Address family for NAME not supported.  */
#  define EAI_INPROGRESS  -100  /* Processing request in progress.  */
#  define EAI_CANCELED    -101  /* Request canceled.  */
#  define EAI_NOTCANCELED -102  /* Request not canceled.  */
#  define EAI_ALLDONE     -103  /* All requests done.  */
#  define EAI_INTR        -104  /* Interrupted by a signal.  */
#  define EAI_IDN_ENCODE  -105  /* IDN encoding failed.  */

So, we can conclude that on Android getaddrinfo is returning EAI_NODATA while on GNU/Linux it's EAI_NONAME.

Now, I think this should clarify what is the actual problem. Consider how Go handles this result value:

    if gerrno != 0 {
        isErrorNoSuchHost := false
        isTemporary := false
        switch gerrno {
        case C.EAI_SYSTEM:
            if err == nil {
                // err should not be nil, but sometimes getaddrinfo returns
                // gerrno == C.EAI_SYSTEM with err == nil on Linux.
                // The report claims that it happens when we have too many
                // open files, so use syscall.EMFILE (too many open files in system).
                // Most system calls would return ENFILE (too many open files),
                // so at the least EMFILE should be easy to recognize if this
                // comes up again. golang.org/issue/6232.
                err = syscall.EMFILE
            }
        case C.EAI_NONAME:
            err = errNoSuchHost
            isErrorNoSuchHost = true
        default:
            err = addrinfoErrno(gerrno)
            isTemporary = addrinfoErrno(gerrno).Temporary()
        }

        return nil, "", &DNSError{Err: err.Error(), Name: name, IsNotFound: isErrorNoSuchHost, IsTemporary: isTemporary}
    }

and it should be clear why we don't get the errNoSuchHost error.

bassosimone commented 2 years ago

So, the simplest fix for this issue seems to add a specific filtering rule for Android that treats the given string specifically. A more structurde fix seems that we call getaddrinfo directly. Then, it becomes our problem to correctly map the results of getaddrinfo to errors. And, also, we have an opportunity of recording the real getaddrinfo return code.

bassosimone commented 2 years ago

Commit https://github.com/ooni/probe-cli/commit/2b48dcfb1a820b2ef28a7d478c3c94f0f8ef4e29 contains an hotfix that should be good enough for the release/3.14 branch. It's clear we want a more comprehensive solution for release/3.15+.

bassosimone commented 2 years ago

We're not going to close this issue now, since we're still working out a solution for master that improves our data quality (see https://github.com/ooni/probe-cli/pull/698#issuecomment-1048849964). But we've done what we needed for release/3.14.

bassosimone commented 2 years ago

For release/3.15, I have cherry-picked the fix originally developed for 3.14: https://github.com/ooni/probe-cli/commit/ce35b64953cc4bf276c45e44e4878c91da72caaf

bassosimone commented 2 years ago

Here's another measurement with this issue: https://explorer.ooni.org/measurement/20220121T025535Z_webconnectivity_PK_17557_n1_s8CkxsxBttmZoFZ1?input=http%3A%2F%2Fgovernmentofbalochistan.blogspot.com%2F (cc: @gurshabad)

bassosimone commented 2 years ago

I am about to merge https://github.com/ooni/probe-cli/pull/764, which provides a theory explaining the issue we were witnessing for Android devices. A theory does not mean that it's necessarily reality, but it's convincing.

Here's the theory:

// 3. Android libc: EAI_NODATA is defined in netdb.h and is not
// protected by any feature flag. The getaddrinfo function (as
// of 4ebdeebef74) calls android_getaddrinfofornet, which in turns
// calls android_getaddrinfofornetcontext. This function will
// eventually call android_getaddrinfo_proxy. If this function
// returns any status code different from EAI_SYSTEM, then bionic
// will return its return value. Otherwise, the code ends up
// calling explore_fqdn, which in turn calls nsdispatch, which
// is what NetBSD is still doing today.
//
// So, android_getaddrinfo_proxy was introduced a long time
// ago on October 28, 2010 by this commit:
//
//     https://github.com/aosp-mirror/platform_bionic/commit/a1dbf0b453801620565e5911f354f82706b0200d
//
// Then a subsequent commit changed android_getaddrinfo_proxy
// to basically default to EAI_NODATA on proxy errors:
//
//     https://github.com/aosp-mirror/platform_bionic/commit/c63e59039d28c352e3053bb81319e960c392dbd4
//
// As of today and 4ebdeebef74, android_getaddrinfo_proxy returns
// one of the following possible return codes:
//
// a) 0 on success;
//
// b) EAI_SYSTEM if it cannot speak to the proxy (which causes the code
// to fall through to the original NetBSD implementation);
//
// c) EAI_NODATA in all the other cases.
//
// The above discussion about Android provides us with a theory that explains the
// https://github.com/ooni/probe/issues/2029 issue. That said, we are still missing
// some bits, e.g., why some Android 6 phones did not experience this problem.
//
// We originally proposed to handle the EAI_NODATA error on Android like it was a
// EAI_NONAME error. However, this mapping seems very inaccurate. Any error inside
// the DNS proxy could cause EAI_NODATA (_unless_ we're "lucky" for some reason
// and the original NetBSD code runs). Therefore, the sanest choice is to introduce
// a new OONI error describing this error condition `android_dns_cache_no_data`
// and handle this error as a special case when checking for NXDOMAIN.

See https://github.com/ooni/probe-cli/pull/764/files#diff-3b2397df28603c7fc86efad2e2e2d7a77b2e2806e7cb74b0eaea110aa1238f25R92 for the complete discussion.

bassosimone commented 2 years ago

After https://github.com/ooni/probe-cli/pull/765, there's one (hopefully last) residual issue: if I cross compile, say, miniooni for windows/amd64, my understanding from the codebase is that I should be using WSA's getaddrinfo via the Go standard library rather than netgo. However, the measurement result would say I am using netgo.

bassosimone commented 2 years ago

To better understand my above comment (https://github.com/ooni/probe/issues/2029#issuecomment-1140723271), let us walk through the source code and see what happens in the standard library when we use the DefaultResolver.

Windows

  1. we call *Resolver.LookupHost in src/net/lookup.go, a file that is not protected by any build tags (https://github.com/golang/go/blob/go1.18.2/src/net/lookup.go#L179)

  2. LookupHost calls *Resolver.lookupHost, which for windows is located in src/net/lookup_windows.go, a file that is not protected by any build tags (https://github.com/golang/go/blob/go1.18.2/src/net/lookup_windows.go#L73)

  3. lookupHost calls lookupIP in src/net/lookup_windows.go (https://github.com/golang/go/blob/go1.18.2/src/net/lookup_windows.go#L85)

  4. lookupIP calls syscalls.GetAddrinfoW

Thus, when CGO is disabled under Windows we end up calling GetAddrinfoW anyway. (It's interesting to note that our implementation calls getaddrinfo, and I am not sure about the implication of calling getaddrinfo versus calling GetAddrinfoW. I suspect the main different would be in the handling of non-ASCII domain names, though it is true that we are applying IDNA before calling getaddrinfo, so probably we're fine.)

From the above observation, it stems that it's very incorrect and misleading to call this resolver "netgo".

Unix

  1. we call *Resolver.LookupHost in src/net/lookup.go, a file that is not protected by any build tags (https://github.com/golang/go/blob/go1.18.2/src/net/lookup.go#L179)

  2. LookupHost calls lookup_host in src/net/lookup_unix.go (https://github.com/golang/go/blob/go1.18.2/src/net/lookup_unix.go#L78)

  3. we may end up calling cgoLookupHost, for which we have two implementations: a stub in src/net/cgo_stub.go and the real implementation in src/net/cgo_unix.go.

  4. the src/net/cgo_stub.go file is guarded by the //go:build !cgo || netgo build flags and its cgoLookupHost implementation returns false (https://github.com/golang/go/blob/go1.18.2/src/net/cgo_stub.go#L19)

  5. when cgoLookupHost returns false, we end up calling goLookupHostOrder in src/net/dnsclient_unix.go, which implements the "netgo" resolver (https://github.com/golang/go/blob/go1.18.2/src/net/dnsclient_unix.go#L530)

In conclusion, when we're under Unix it's proper to call "netgo" the resolver we get when CGO is disabled or the user has compiled using "netgo". That is, both "netgo" and disabling "cgo" only have an impact under Unix.

bassosimone commented 2 years ago

I created https://github.com/ooni/probe/issues/2120 to track the issue that we cannot consider the result of getaddrinfo called on Android reliable unless the result is successful.

bassosimone commented 2 years ago

Most work required by this issue has been done. It remains to merge https://github.com/ooni/probe-cli/pull/698.

bassosimone commented 2 years ago

Ah, wait, another piece of work is to implement the Resolver.LookupHostAndCNAME operation. This operation would return at the same time the addresses and the CNAME for the domain. This is the maximum amount of information one could get using the getaddrinfo backend. For the other resolvers, it's gonna be a bit more tricky to easily expose that but I suppose we could just check whether A and AAAA replies returned a CNAME. It MAY also be useful to pause for a second and think about how to improve measurex in light of recent DNS changes, because it MAY be that exposing this operation as a general DNS resolver operation is not actually needed or useful.

bassosimone commented 2 years ago

(This issue looked like medium priority when we triaged it, but it's actually high priority now.)

bassosimone commented 1 year ago

As of 3.17.0-alpha.1, we correctly detect this issue for Android by using the android_dns_cache_no_data failure. However, we still need to correctly handle this case when processing results in webconnectivity LTE. I am going to add a note to the webconnectivity LTE codebase referring back to this issue comment.