Open reubenmiller opened 9 months ago
After doing some further debugging I found that an influencing factor is when there is an erroneous entry in the /etc/resolv.conf
. For example for some reason the router that I am using is automatically adding a non-existent nameserver to the /etc/resolv.conf. 2 of the 3 nameservers are still valid however there does not seem to be any retrying of failed DNS errors.
Here was my /etc/resolv.conf
configuration and the non-existent name server is marked with <=
# Generated by resolvconf
domain fritz.box
options timeout:10
nameserver 192.168.178.1 <= This IP address is not reachable!
nameserver fd00::3e37:12ff:fe83:2a0d
nameserver 2001:44b8:2167:a100:3e37:12ff:fe83:2a0d
The 5 second delay comes from the default setting in the resolv.h. But it can be changed in the /etc/resolv.conf
by adding options timeout:10
, however this still does not help because it seems that DNS resolution errors are not retried.
Since DNS resolution errors are difficult to detect, and other tooling like curl and dig silently retry the DNS resolution when it fails, I would suggest that we enable the trust-dns
feature offered by the reqwest crate which supports some more advanced DNS resolution options (including caching).
file: Cargo.toml
reqwest = { version = "0.11", default-features = false, features = ["trust-dns"] }
After rebuilding thin-edge.io, and using the same erroneous resolv.conf file, the check_proxy.sh yielded significantly more reliable results.
$ ./check_proxy.sh
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
------- Attempt 1 (14:48:47) -------
proxy PASSED 3s
curl PASSED 1s
------- Attempt 2 (14:48:56) -------
proxy PASSED 2s
curl PASSED 1s
------- Attempt 3 (14:49:04) -------
proxy PASSED 2s
curl PASSED 1s
------- Attempt 4 (14:49:12) -------
proxy PASSED 2s
curl PASSED 1s
------- Attempt 5 (14:49:20) -------
proxy PASSED 1s
curl PASSED 2s
------- Attempt 6 (14:49:28) -------
proxy PASSED 7s
curl PASSED 6s
------- Attempt 7 (14:49:46) -------
proxy PASSED 1s
curl PASSED 3s
------- Attempt 8 (14:49:55) -------
proxy PASSED 1s
curl PASSED 1s
------- Attempt 9 (14:50:03) -------
proxy PASSED 1s
curl PASSED 1s
------- Attempt 10 (14:50:10) -------
proxy PASSED 2s
curl PASSED 1s
-------------------------
------- Summary ---------
Attempts: 10
proxy failures: 0
curl failures: 0
As mentioned in this comment, if we were to use systemd-resolved
, which overwrites /etc/resolv.conf
to contain only a single, stub resolver, this issue could perhaps be solved as well.
Describe the bug
Requests to thin-edge.io Cumulocity IoT proxy (localhost:8001) returns sporadically with a 502 Bad Gateway status code.
The
tedge-mapper-c8y
which is providing the local Cumulocity IoT proxy logs the following error on a failed proxy attempt:However using curl to communicate directly with the Cumulocity IoT tenant is able to successfully communicate with the tenant. A script was written to compare requests against the local proxy and direct communication with the Cumulocity IoT Tenant:
In combination with the duration of the failed requests and the error message printed in the log, there is a strong correlation with the duration of the request and the failure rate. It seems that if the DNS resolver takes longer than 5 seconds, then proxied request fails.
To Reproduce
Install thin-edge.io and connect it to a Cumulocity IoT tenant
Add a non-existent IP address as a nameserver to the
/etc/resolv.conf
file. There must still be some valid nameservers in the list, and the non-existent IP address should still be a valid ip address.For example,
192.168.213.213
is not reachable (tested viaping 192.168.213.213
):Copy the "check_proxy.sh" script (from the "Additional Context" section)
Execute the check_proxy.sh script
It is expected that the proxy failures should be zero (assuming there is no server communication issues)
Expected behavior
The proxy should be able to handle DNS resolution failures due to misconfiguration of the nameserver (e.g. inside the
/etc/resolv.conf
. As long as there are some valid DNS nameservers configured, then the requests should not fail.Screenshots
Environment (please complete the following information):
Debian GNU/Linux 11 (bullseye)
Raspberry Pi Zero 2 W Rev 1.0
Linux pippin 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
tedge 0.12.1~365+g5a52630
Additional context
This section contains the log output of both the test
check_proxy.sh output
Note, the check_proxy.sh script uses two slightly different urls in the comparison between the proxy and direction HTTP call. The proxy uses the
/tenant/currentTenant
whilst the direct HTTP call uses/tenant/loginOptions
. This is because the later does not require an authorization header reducing the need to fetch a token first. Unfortunately the/tenant/loginOptions
cannot be used in the proxy case as this endpoint fails when the authorization header is provided (which the proxy adds by default). However the two URLs are still deemed to be similar enough, and both require the same host name to be resolved (since there is a high likelihood that this is a DNS issue and not an authorization issue).tedge-mapper-c8y logs
file: check_proxy.sh