thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
211 stars 55 forks source link

c8y proxy fails with sporadic dns lookup address errors #2321

Open reubenmiller opened 9 months ago

reubenmiller commented 9 months ago

Describe the bug

Requests to thin-edge.io Cumulocity IoT proxy (localhost:8001) returns sporadically with a 502 Bad Gateway status code.

The tedge-mapper-c8y which is providing the local Cumulocity IoT proxy logs the following error on a failed proxy attempt:

Oct 06 11:32:26 pippin tedge-mapper[459]: 2023-10-06T01:32:26.117459217Z ERROR c8y_auth_proxy::server:   × making proxied request to https://test-ci-runner01.latest.stage.c8y.io/
Oct 06 11:32:26 pippin tedge-mapper[459]:   │ event/events/3767200/binaries
Oct 06 11:32:26 pippin tedge-mapper[459]:   ├─▶ error sending request for url (https://test-ci-
Oct 06 11:32:26 pippin tedge-mapper[459]:   │   runner01.latest.stage.c8y.io/event/events/3767200/binaries): error
Oct 06 11:32:26 pippin tedge-mapper[459]:   │   trying to connect: dns error: failed to lookup address information:
Oct 06 11:32:26 pippin tedge-mapper[459]:   │   Try again
Oct 06 11:32:26 pippin tedge-mapper[459]:   ├─▶ error trying to connect: dns error: failed to lookup address
Oct 06 11:32:26 pippin tedge-mapper[459]:   │   information: Try again
Oct 06 11:32:26 pippin tedge-mapper[459]:   ├─▶ dns error: failed to lookup address information: Try again
Oct 06 11:32:26 pippin tedge-mapper[459]:   ╰─▶ failed to lookup address information: Try again

However using curl to communicate directly with the Cumulocity IoT tenant is able to successfully communicate with the tenant. A script was written to compare requests against the local proxy and direct communication with the Cumulocity IoT Tenant:

In combination with the duration of the failed requests and the error message printed in the log, there is a strong correlation with the duration of the request and the failure rate. It seems that if the DNS resolver takes longer than 5 seconds, then proxied request fails.

$ ./check_proxy.sh
------- Attempt 1 (12:09:47) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 2 (12:09:55) -------
proxy   PASSED  2s
curl    PASSED  6s
------- Attempt 3 (12:10:08) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 4 (12:10:16) -------
proxy   PASSED  2s
curl    PASSED  6s
------- Attempt 5 (12:10:29) -------
proxy   FAILED  5s
curl    PASSED  1s
------- Attempt 6 (12:10:40) -------
proxy   FAILED  5s
curl    PASSED  1s
------- Attempt 7 (12:10:51) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 8 (12:10:59) -------
proxy   FAILED  5s
curl    PASSED  1s
------- Attempt 9 (12:11:10) -------
proxy   FAILED  5s
curl    PASSED  7s
------- Attempt 10 (12:11:27) -------
proxy   PASSED  1s
curl    PASSED  1s
-------------------------
------- Summary ---------
Attempts: 10
proxy failures: 4
curl failures: 0

To Reproduce

  1. Install thin-edge.io and connect it to a Cumulocity IoT tenant

  2. Add a non-existent IP address as a nameserver to the /etc/resolv.conf file. There must still be some valid nameservers in the list, and the non-existent IP address should still be a valid ip address.

    For example, 192.168.213.213 is not reachable (tested via ping 192.168.213.213):

    nameserver 192.168.213.213
    nameserver fd00::3e37:12ff:fe83:2a0d
    nameserver 2001:44b8:2167:a100:3e37:12ff:fe83:2a0d
  3. Copy the "check_proxy.sh" script (from the "Additional Context" section)

  4. Execute the check_proxy.sh script

    chmod +x check_proxy.sh
    ./check_proxy.sh

    It is expected that the proxy failures should be zero (assuming there is no server communication issues)

Expected behavior

The proxy should be able to handle DNS resolution failures due to misconfiguration of the nameserver (e.g. inside the /etc/resolv.conf. As long as there are some valid DNS nameservers configured, then the requests should not fail.

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Debian GNU/Linux 11 (bullseye)
Hardware [incl. revision] Raspberry Pi Zero 2 W Rev 1.0
System-Architecture Linux pippin 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
thin-edge.io version tedge 0.12.1~365+g5a52630

Additional context

This section contains the log output of both the test

check_proxy.sh output

$ ./check_proxy.sh
------- Attempt 1 (12:09:47) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 2 (12:09:55) -------
proxy   PASSED  2s
curl    PASSED  6s
------- Attempt 3 (12:10:08) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 4 (12:10:16) -------
proxy   PASSED  2s
curl    PASSED  6s
------- Attempt 5 (12:10:29) -------
proxy   FAILED  5s
curl    PASSED  1s
------- Attempt 6 (12:10:40) -------
proxy   FAILED  5s
curl    PASSED  1s
------- Attempt 7 (12:10:51) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 8 (12:10:59) -------
proxy   FAILED  5s
curl    PASSED  1s
------- Attempt 9 (12:11:10) -------
proxy   FAILED  5s
curl    PASSED  7s
------- Attempt 10 (12:11:27) -------
proxy   PASSED  1s
curl    PASSED  1s
-------------------------
------- Summary ---------
Attempts: 10
proxy failures: 4
curl failures: 0

Note, the check_proxy.sh script uses two slightly different urls in the comparison between the proxy and direction HTTP call. The proxy uses the /tenant/currentTenant whilst the direct HTTP call uses /tenant/loginOptions. This is because the later does not require an authorization header reducing the need to fetch a token first. Unfortunately the /tenant/loginOptions cannot be used in the proxy case as this endpoint fails when the authorization header is provided (which the proxy adds by default). However the two URLs are still deemed to be similar enough, and both require the same host name to be resolved (since there is a high likelihood that this is a DNS issue and not an authorization issue).

tedge-mapper-c8y logs

$ journalctl -fu tedge-mapper-c8y -n 0
-- Journal begins at Wed 2023-05-03 13:11:24 AEST. --
Oct 06 12:10:34 pippin tedge-mapper[459]: 2023-10-06T02:10:34.388008161Z ERROR c8y_auth_proxy::server:   × making proxied request to https://test-ci-runner01.latest.stage.c8y.io/
Oct 06 12:10:34 pippin tedge-mapper[459]:   │ tenant/currentTenant
Oct 06 12:10:34 pippin tedge-mapper[459]:   ├─▶ error sending request for url (https://test-ci-
Oct 06 12:10:34 pippin tedge-mapper[459]:   │   runner01.latest.stage.c8y.io/tenant/currentTenant): error trying to
Oct 06 12:10:34 pippin tedge-mapper[459]:   │   connect: dns error: failed to lookup address information: Try again
Oct 06 12:10:34 pippin tedge-mapper[459]:   ├─▶ error trying to connect: dns error: failed to lookup address
Oct 06 12:10:34 pippin tedge-mapper[459]:   │   information: Try again
Oct 06 12:10:34 pippin tedge-mapper[459]:   ├─▶ dns error: failed to lookup address information: Try again
Oct 06 12:10:34 pippin tedge-mapper[459]:   ╰─▶ failed to lookup address information: Try again
Oct 06 12:10:45 pippin tedge-mapper[459]: 2023-10-06T02:10:45.632226176Z ERROR c8y_auth_proxy::server:   × making proxied request to https://test-ci-runner01.latest.stage.c8y.io/
Oct 06 12:10:45 pippin tedge-mapper[459]:   │ tenant/currentTenant
Oct 06 12:10:45 pippin tedge-mapper[459]:   ├─▶ error sending request for url (https://test-ci-
Oct 06 12:10:45 pippin tedge-mapper[459]:   │   runner01.latest.stage.c8y.io/tenant/currentTenant): error trying to
Oct 06 12:10:45 pippin tedge-mapper[459]:   │   connect: dns error: failed to lookup address information: Try again
Oct 06 12:10:45 pippin tedge-mapper[459]:   ├─▶ error trying to connect: dns error: failed to lookup address
Oct 06 12:10:45 pippin tedge-mapper[459]:   │   information: Try again
Oct 06 12:10:45 pippin tedge-mapper[459]:   ├─▶ dns error: failed to lookup address information: Try again
Oct 06 12:10:45 pippin tedge-mapper[459]:   ╰─▶ failed to lookup address information: Try again
Oct 06 12:11:04 pippin tedge-mapper[459]: 2023-10-06T02:11:04.6176006Z ERROR c8y_auth_proxy::server:   × making proxied request to https://test-ci-runner01.latest.stage.c8y.io/
Oct 06 12:11:04 pippin tedge-mapper[459]:   │ tenant/currentTenant
Oct 06 12:11:04 pippin tedge-mapper[459]:   ├─▶ error sending request for url (https://test-ci-
Oct 06 12:11:04 pippin tedge-mapper[459]:   │   runner01.latest.stage.c8y.io/tenant/currentTenant): error trying to
Oct 06 12:11:04 pippin tedge-mapper[459]:   │   connect: dns error: failed to lookup address information: Try again
Oct 06 12:11:04 pippin tedge-mapper[459]:   ├─▶ error trying to connect: dns error: failed to lookup address
Oct 06 12:11:04 pippin tedge-mapper[459]:   │   information: Try again
Oct 06 12:11:04 pippin tedge-mapper[459]:   ├─▶ dns error: failed to lookup address information: Try again
Oct 06 12:11:04 pippin tedge-mapper[459]:   ╰─▶ failed to lookup address information: Try again
Oct 06 12:11:15 pippin tedge-mapper[459]: 2023-10-06T02:11:15.844906908Z ERROR c8y_auth_proxy::server:   × making proxied request to https://test-ci-runner01.latest.stage.c8y.io/
Oct 06 12:11:15 pippin tedge-mapper[459]:   │ tenant/currentTenant
Oct 06 12:11:15 pippin tedge-mapper[459]:   ├─▶ error sending request for url (https://test-ci-
Oct 06 12:11:15 pippin tedge-mapper[459]:   │   runner01.latest.stage.c8y.io/tenant/currentTenant): error trying to
Oct 06 12:11:15 pippin tedge-mapper[459]:   │   connect: dns error: failed to lookup address information: Try again
Oct 06 12:11:15 pippin tedge-mapper[459]:   ├─▶ error trying to connect: dns error: failed to lookup address
Oct 06 12:11:15 pippin tedge-mapper[459]:   │   information: Try again
Oct 06 12:11:15 pippin tedge-mapper[459]:   ├─▶ dns error: failed to lookup address information: Try again
Oct 06 12:11:15 pippin tedge-mapper[459]:   ╰─▶ failed to lookup address information: Try again

file: check_proxy.sh

#!/bin/bash

set -e

PROXY_FAILURES=0
CURL_FAILURES=0

C8Y_URL=$(tedge config get c8y.url)

check_url() {
    attempt=$1
    printf -- '------- Attempt %s (%s) -------\n' "$attempt" "$(date +%H:%M:%S)"
    start=$(date +%s)
    RESULT="PASSED"
    # if ! curl -sf http://localhost:8001/c8y/event/events/3767200/binaries >/dev/null; then
    if ! curl -sf http://localhost:8001/c8y/tenant/currentTenant >/dev/null; then
        PROXY_FAILURES=$((PROXY_FAILURES + 1))
        RESULT="FAILED"
    fi
    end=$(date +%s)
    duration=$((end - start))
    printf 'proxy\t%s\t%ds\n' "$RESULT" "$duration"

    # using curl
    start=$(date +%s)
    RESULT="PASSED"

    if ! curl -sf "https://$C8Y_URL/tenant/loginOptions" >/dev/null; then
        CURL_FAILURES=$((CURL_FAILURES + 1))
        RESULT="FAILED"
    fi
    end=$(date +%s)
    duration=$((end - start))
    printf 'curl\t%s\t%ds\n' "$RESULT" "$duration"
}

COUNTER=1
TOTAL_TRIES=10
echo
while [ "$COUNTER" -le "$TOTAL_TRIES" ]; do
    check_url "$COUNTER"
    COUNTER=$((COUNTER + 1))
    sleep 5
done

printf -- '-------------------------\n'
printf -- '------- Summary ---------\n'
printf 'Attempts: %d\n' "$TOTAL_TRIES"
printf 'proxy failures: %d\n' "$PROXY_FAILURES"
printf 'curl failures: %d\n' "$CURL_FAILURES"
reubenmiller commented 9 months ago

After doing some further debugging I found that an influencing factor is when there is an erroneous entry in the /etc/resolv.conf. For example for some reason the router that I am using is automatically adding a non-existent nameserver to the /etc/resolv.conf. 2 of the 3 nameservers are still valid however there does not seem to be any retrying of failed DNS errors.

Here was my /etc/resolv.conf configuration and the non-existent name server is marked with <=

# Generated by resolvconf
domain fritz.box
options timeout:10
nameserver 192.168.178.1        <= This IP address is not reachable!
nameserver fd00::3e37:12ff:fe83:2a0d
nameserver 2001:44b8:2167:a100:3e37:12ff:fe83:2a0d

The 5 second delay comes from the default setting in the resolv.h. But it can be changed in the /etc/resolv.conf by adding options timeout:10, however this still does not help because it seems that DNS resolution errors are not retried.

Since DNS resolution errors are difficult to detect, and other tooling like curl and dig silently retry the DNS resolution when it fails, I would suggest that we enable the trust-dns feature offered by the reqwest crate which supports some more advanced DNS resolution options (including caching).

file: Cargo.toml

reqwest = { version = "0.11", default-features = false, features = ["trust-dns"] }

After rebuilding thin-edge.io, and using the same erroneous resolv.conf file, the check_proxy.sh yielded significantly more reliable results.

$ ./check_proxy.sh
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

------- Attempt 1 (14:48:47) -------
proxy   PASSED  3s
curl    PASSED  1s
------- Attempt 2 (14:48:56) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 3 (14:49:04) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 4 (14:49:12) -------
proxy   PASSED  2s
curl    PASSED  1s
------- Attempt 5 (14:49:20) -------
proxy   PASSED  1s
curl    PASSED  2s
------- Attempt 6 (14:49:28) -------
proxy   PASSED  7s
curl    PASSED  6s
------- Attempt 7 (14:49:46) -------
proxy   PASSED  1s
curl    PASSED  3s
------- Attempt 8 (14:49:55) -------
proxy   PASSED  1s
curl    PASSED  1s
------- Attempt 9 (14:50:03) -------
proxy   PASSED  1s
curl    PASSED  1s
------- Attempt 10 (14:50:10) -------
proxy   PASSED  2s
curl    PASSED  1s
-------------------------
------- Summary ---------
Attempts: 10
proxy failures: 0
curl failures: 0
Bravo555 commented 2 months ago

As mentioned in this comment, if we were to use systemd-resolved, which overwrites /etc/resolv.conf to contain only a single, stub resolver, this issue could perhaps be solved as well.