Open hellais opened 2 years ago
It seems this problem is interesting enough to deserve some in-depth level of double checking. We need to figure out whether there are some specifics that make it work somewhere or whether it's all broken.
Let's use as test case the https://thefoobarbay.com/
URL and web_connectivity
as the experiment. From my network, AS30722, it's pretty clear that the domain does not exist. So, it should be a good proxy of the original issue.
So, the first objective is to figure our in which platforms does it happen.
software | platform | engine_version | result | report |
---|---|---|---|---|
ooniprobe-android | android | 3.14.0-beta | unknown failure | # |
miniooni | windows | 3.15.0-alpha | nxdomain | # |
miniooni | linux | 3.15.0-alpha | nxdomain | # |
miniooni | darwin | 3.15.0-alpha | nxdomain | # |
ooniprobe | darwin | 3.14.0-beta | nxdomain | # |
ooniprobe | linux | 3.14.0-beta | nxdomain | # |
ooniprobe | windows | 3.14.0-beta | nxdomain | # |
ooniprobe-ios | ios | 3.14.0-beta | nxdomain | # |
So, this seems to be an Android-only problem. This begs two questions: (1) what is the root cause and (2) how to test it.
So, debugging this issue is not immediately straightforward. I think it makes sense for me to explain how I did it.
The first thing to do is to start an emulator using Android studio.
Then, you need root access (IIUC this is mainly to be able to write the disk):
adb root
Then, you need to compile miniooni
for Android. In my case, I'll compile for Android/i686 because my emulator is using this architecture. We also need to ensure we're linking with libc. There's two reasons why we need to do that. The first reason is that we're trying to debug/test getaddrinfo
, so we need to be using it. The second, lesser reason is that miniooni
compiled just for linux/386
does not work. This seems to be related with the Go resolver, but I am not sure. (Since this is a sidetrack, I'll try to investigate this specific aspect later, once I've understood the main issue.)
GOOS=android GOARCH=386 CGO_ENABLED=1 CC=$HOME/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android30-clang LD=$HOME/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/ld go build -v ./internal/cmd/miniooni
Then you need to copy the miniooni
binary:
$HOME/sdk/ooni-android/platform-tools/adb push miniooni /cache
Then you need to shell into Android:
~/sdk/ooni-android/platform-tools/adb shell
At this point it's straightforward:
cd /cache
chmod +x miniooni
./miniooni -i https://thefoobarbay.com web_connectivity
And here's the measurement: https://explorer.ooni.org/measurement/20220221T135225Z_webconnectivity_IT_30722_n1_ersvWVUkwEUJf22g?input=https%3A%2F%2Fthefoobarbay.com
So, the nice fact about this is that we can reproduce using miniooni
. So the dev cycle here is much faster.
To gather more information about what is happening, here's a simple test program:
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netdb.h>
int main() {
struct addrinfo *res;
struct addrinfo hints;
memset(&hints, 0, sizeof (hints));
hints.ai_flags |= AI_CANONNAME;
int r = getaddrinfo("thefoobarbay.com", NULL, &hints, &res);
if (r != 0) {
fprintf(stderr, "failure: %d %s\n", r, gai_strerror(r));
exit(1);
}
struct addrinfo *aip = res;
for (; aip != NULL; aip = aip->ai_next) {
fprintf(stderr, "- %p\n", aip);
}
freeaddrinfo(res);
return 0;
}
We can compile using
~/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android30-clang -Wall -Wextra getaddrinfo.c
After copying and running on the device, as we did above, we obtain:
./a.out
failure: 7 No address associated with hostname
Now, the interesting thing is that, when running the same small program locally on GNU/Linux, I get:
./a.out
failure: -2 Name or service not known
If we read the resolv.conf
used by Android in netdb.h
, we see:
/*
* Error return codes from getaddrinfo()
*/
#define EAI_ADDRFAMILY 1 /* address family for hostname not supported */
#define EAI_AGAIN 2 /* temporary failure in name resolution */
#define EAI_BADFLAGS 3 /* invalid value for ai_flags */
#define EAI_FAIL 4 /* non-recoverable failure in name resolution */
#define EAI_FAMILY 5 /* ai_family not supported */
#define EAI_MEMORY 6 /* memory allocation failure */
#define EAI_NODATA 7 /* no address associated with hostname */
#define EAI_NONAME 8 /* hostname nor servname provided, or not known */
#define EAI_SERVICE 9 /* servname not supported for ai_socktype */
#define EAI_SOCKTYPE 10 /* ai_socktype not supported */
#define EAI_SYSTEM 11 /* system error returned in errno */
#define EAI_BADHINTS 12 /* invalid value for hints */
#define EAI_PROTOCOL 13 /* resolved protocol is unknown */
#define EAI_OVERFLOW 14 /* argument buffer overflow */
#define EAI_MAX 15
whereas, if we read my GNU/Linux system's netdb.h
, we see:
/* Error values for `getaddrinfo' function. */
# define EAI_BADFLAGS -1 /* Invalid value for `ai_flags' field. */
# define EAI_NONAME -2 /* NAME or SERVICE is unknown. */
# define EAI_AGAIN -3 /* Temporary failure in name resolution. */
# define EAI_FAIL -4 /* Non-recoverable failure in name res. */
# define EAI_FAMILY -6 /* `ai_family' not supported. */
# define EAI_SOCKTYPE -7 /* `ai_socktype' not supported. */
# define EAI_SERVICE -8 /* SERVICE not supported for `ai_socktype'. */
# define EAI_MEMORY -10 /* Memory allocation failure. */
# define EAI_SYSTEM -11 /* System error returned in `errno'. */
# define EAI_OVERFLOW -12 /* Argument buffer overflow. */
# ifdef __USE_GNU
# define EAI_NODATA -5 /* No address associated with NAME. */
# define EAI_ADDRFAMILY -9 /* Address family for NAME not supported. */
# define EAI_INPROGRESS -100 /* Processing request in progress. */
# define EAI_CANCELED -101 /* Request canceled. */
# define EAI_NOTCANCELED -102 /* Request not canceled. */
# define EAI_ALLDONE -103 /* All requests done. */
# define EAI_INTR -104 /* Interrupted by a signal. */
# define EAI_IDN_ENCODE -105 /* IDN encoding failed. */
So, we can conclude that on Android getaddrinfo
is returning EAI_NODATA
while on GNU/Linux it's EAI_NONAME
.
Now, I think this should clarify what is the actual problem. Consider how Go handles this result value:
if gerrno != 0 {
isErrorNoSuchHost := false
isTemporary := false
switch gerrno {
case C.EAI_SYSTEM:
if err == nil {
// err should not be nil, but sometimes getaddrinfo returns
// gerrno == C.EAI_SYSTEM with err == nil on Linux.
// The report claims that it happens when we have too many
// open files, so use syscall.EMFILE (too many open files in system).
// Most system calls would return ENFILE (too many open files),
// so at the least EMFILE should be easy to recognize if this
// comes up again. golang.org/issue/6232.
err = syscall.EMFILE
}
case C.EAI_NONAME:
err = errNoSuchHost
isErrorNoSuchHost = true
default:
err = addrinfoErrno(gerrno)
isTemporary = addrinfoErrno(gerrno).Temporary()
}
return nil, "", &DNSError{Err: err.Error(), Name: name, IsNotFound: isErrorNoSuchHost, IsTemporary: isTemporary}
}
and it should be clear why we don't get the errNoSuchHost
error.
So, the simplest fix for this issue seems to add a specific filtering rule for Android that treats the given string specifically. A more structurde fix seems that we call getaddrinfo
directly. Then, it becomes our problem to correctly map the results of getaddrinfo
to errors. And, also, we have an opportunity of recording the real getaddrinfo
return code.
Commit https://github.com/ooni/probe-cli/commit/2b48dcfb1a820b2ef28a7d478c3c94f0f8ef4e29 contains an hotfix that should be good enough for the release/3.14
branch. It's clear we want a more comprehensive solution for release/3.15+
.
We're not going to close this issue now, since we're still working out a solution for master
that improves our data quality (see https://github.com/ooni/probe-cli/pull/698#issuecomment-1048849964). But we've done what we needed for release/3.14
.
For release/3.15
, I have cherry-picked the fix originally developed for 3.14
: https://github.com/ooni/probe-cli/commit/ce35b64953cc4bf276c45e44e4878c91da72caaf
Here's another measurement with this issue: https://explorer.ooni.org/measurement/20220121T025535Z_webconnectivity_PK_17557_n1_s8CkxsxBttmZoFZ1?input=http%3A%2F%2Fgovernmentofbalochistan.blogspot.com%2F (cc: @gurshabad)
I am about to merge https://github.com/ooni/probe-cli/pull/764, which provides a theory explaining the issue we were witnessing for Android devices. A theory does not mean that it's necessarily reality, but it's convincing.
Here's the theory:
// 3. Android libc: EAI_NODATA is defined in netdb.h and is not // protected by any feature flag. The getaddrinfo function (as // of 4ebdeebef74) calls android_getaddrinfofornet, which in turns // calls android_getaddrinfofornetcontext. This function will // eventually call android_getaddrinfo_proxy. If this function // returns any status code different from EAI_SYSTEM, then bionic // will return its return value. Otherwise, the code ends up // calling explore_fqdn, which in turn calls nsdispatch, which // is what NetBSD is still doing today. // // So, android_getaddrinfo_proxy was introduced a long time // ago on October 28, 2010 by this commit: // // https://github.com/aosp-mirror/platform_bionic/commit/a1dbf0b453801620565e5911f354f82706b0200d // // Then a subsequent commit changed android_getaddrinfo_proxy // to basically default to EAI_NODATA on proxy errors: // // https://github.com/aosp-mirror/platform_bionic/commit/c63e59039d28c352e3053bb81319e960c392dbd4 // // As of today and 4ebdeebef74, android_getaddrinfo_proxy returns // one of the following possible return codes: // // a) 0 on success; // // b) EAI_SYSTEM if it cannot speak to the proxy (which causes the code // to fall through to the original NetBSD implementation); // // c) EAI_NODATA in all the other cases. // // The above discussion about Android provides us with a theory that explains the // https://github.com/ooni/probe/issues/2029 issue. That said, we are still missing // some bits, e.g., why some Android 6 phones did not experience this problem. // // We originally proposed to handle the EAI_NODATA error on Android like it was a // EAI_NONAME error. However, this mapping seems very inaccurate. Any error inside // the DNS proxy could cause EAI_NODATA (_unless_ we're "lucky" for some reason // and the original NetBSD code runs). Therefore, the sanest choice is to introduce // a new OONI error describing this error condition `android_dns_cache_no_data` // and handle this error as a special case when checking for NXDOMAIN.
See https://github.com/ooni/probe-cli/pull/764/files#diff-3b2397df28603c7fc86efad2e2e2d7a77b2e2806e7cb74b0eaea110aa1238f25R92 for the complete discussion.
After https://github.com/ooni/probe-cli/pull/765, there's one (hopefully last) residual issue: if I cross compile, say, miniooni
for windows/amd64
, my understanding from the codebase is that I should be using WSA's getaddrinfo
via the Go standard library rather than netgo
. However, the measurement result would say I am using netgo
.
To better understand my above comment (https://github.com/ooni/probe/issues/2029#issuecomment-1140723271), let us walk through the source code and see what happens in the standard library when we use the DefaultResolver
.
we call *Resolver.LookupHost
in src/net/lookup.go
, a file that is not protected by any build tags (https://github.com/golang/go/blob/go1.18.2/src/net/lookup.go#L179)
LookupHost
calls *Resolver.lookupHost
, which for windows is located in src/net/lookup_windows.go
, a file that is not protected by any build tags (https://github.com/golang/go/blob/go1.18.2/src/net/lookup_windows.go#L73)
lookupHost
calls lookupIP in src/net/lookup_windows.go
(https://github.com/golang/go/blob/go1.18.2/src/net/lookup_windows.go#L85)
lookupIP
calls syscalls.GetAddrinfoW
Thus, when CGO
is disabled under Windows we end up calling GetAddrinfoW
anyway. (It's interesting to note that our implementation calls getaddrinfo
, and I am not sure about the implication of calling getaddrinfo
versus calling GetAddrinfoW
. I suspect the main different would be in the handling of non-ASCII domain names, though it is true that we are applying IDNA before calling getaddrinfo
, so probably we're fine.)
From the above observation, it stems that it's very incorrect and misleading to call this resolver "netgo".
we call *Resolver.LookupHost
in src/net/lookup.go
, a file that is not protected by any build tags (https://github.com/golang/go/blob/go1.18.2/src/net/lookup.go#L179)
LookupHost
calls lookup_host
in src/net/lookup_unix.go
(https://github.com/golang/go/blob/go1.18.2/src/net/lookup_unix.go#L78)
we may end up calling cgoLookupHost
, for which we have two implementations: a stub in src/net/cgo_stub.go
and the real implementation in src/net/cgo_unix.go
.
the src/net/cgo_stub.go
file is guarded by the //go:build !cgo || netgo
build flags and its cgoLookupHost
implementation returns false
(https://github.com/golang/go/blob/go1.18.2/src/net/cgo_stub.go#L19)
when cgoLookupHost
returns false, we end up calling goLookupHostOrder
in src/net/dnsclient_unix.go
, which implements the "netgo" resolver (https://github.com/golang/go/blob/go1.18.2/src/net/dnsclient_unix.go#L530)
In conclusion, when we're under Unix it's proper to call "netgo" the resolver we get when CGO is disabled or the user has compiled using "netgo". That is, both "netgo" and disabling "cgo" only have an impact under Unix.
I created https://github.com/ooni/probe/issues/2120 to track the issue that we cannot consider the result of getaddrinfo
called on Android reliable unless the result is successful.
Most work required by this issue has been done. It remains to merge https://github.com/ooni/probe-cli/pull/698.
Ah, wait, another piece of work is to implement the Resolver.LookupHostAndCNAME
operation. This operation would return at the same time the addresses and the CNAME for the domain. This is the maximum amount of information one could get using the getaddrinfo
backend. For the other resolvers, it's gonna be a bit more tricky to easily expose that but I suppose we could just check whether A and AAAA replies returned a CNAME. It MAY also be useful to pause for a second and think about how to improve measurex
in light of recent DNS changes, because it MAY be that exposing this operation as a general DNS resolver operation is not actually needed or useful.
(This issue looked like medium priority when we triaged it, but it's actually high priority now.)
As of 3.17.0-alpha.1, we correctly detect this issue for Android by using the android_dns_cache_no_data
failure. However, we still need to correctly handle this case when processing results in webconnectivity LTE. I am going to add a note to the webconnectivity LTE codebase referring back to this issue comment.
We noticed that certain NXDOMAIN errors are not properly handled by the engine.
Here is a sample measurement that results in an unknown_failure: https://explorer.ooni.org/measurement/20220215T025431Z_webconnectivity_HK_9231_n1_oU1CBKrQbHFoou5h?input=http%3A%2F%2Fwww.hongkongwatch.org