mender-client connection issue

mendersoftware / mender-server

Other

6 stars 13 forks source link

mender-client connection issue #112

Closed NTohan closed 1 month ago

NTohan commented 1 month ago

With main branch on hash 7dfc4d8501f58142cb0d45595ea3d0163908efa6 mender-client is not able to connect to the mender-server deployed in the production mode.

mender-client connection is tested on the target device using a local IP with mender-server running on the same network

$ wget -O- https://get.mender.io/ | sudo bash -s -- --demo --force-mender-client4 -- --quiet --device-type "genericx86-64" --demo --server-url https://192.168.x.x/ --server-cert=""

Using above command leads to no connection and no new device pending for approval on mender-server running in the same network.

Workaround: Using Mender Server with Cloudflare Reverse Proxy mender-client is able to connect mender-server by disabling https scheme and websecure.

$ git diff
diff --git a/docker-compose.yml b/docker-compose.yml
index b363769..e597262 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -330,13 +330,13 @@ services:
       - --api.insecure=true
       - --accesslog=true
       - --entrypoints.web.address=:80
-      - --entrypoints.web.http.redirections.entryPoint.scheme=https
+      - --entrypoints.web.http.redirections.entryPoint.scheme=http
       - --entrypoints.web.http.redirections.entryPoint.to=websecure
       - --entrypoints.websecure.address=:443
       - --entrypoints.websecure.transport.respondingTimeouts.idleTimeout=7200
       - --entrypoints.websecure.transport.respondingTimeouts.readTimeout=7200
       - --entrypoints.websecure.transport.respondingTimeouts.writeTimeout=7200
-      - --entrypoints.websecure.http.tls=true
+      - --entrypoints.websecure.http.tls=false
       - --providers.file.directory=/etc/traefik/config

With https scheme and websecure.https.tls being disabled in mender-server config, we are relying on the security layer of a CDN service (cloudflare) using secure cloudflare tunnel (Cloudflare reverse proxy). This setup allows us to access mender-server using https://devices.domain.xxx pointing to http://192.168.x.x:**443** and install mender-client on the target device using:

$ wget -O- https://get.mender.io/ | sudo bash -s -- --demo --force-mender-client4 -- --quiet --device-type "genericx86-64" --demo --server-url https://devices.domain.xxx/ --server-cert=""

Is this a recommend workflow to use mender-server or will you recommend to rely on mender-server https and webecure due to security concerns?

Thank you in advance for any inputs.

Cheers, N.T.

alfrunes commented 1 month ago

Hello, Thanks again for feedback. The Docker Compose setup is only meant for demo / evaluation purposes only and should not be used for production environments. When going to production I highly suggest using Kubernetes with our helm chart https://github.com/mendersoftware/mender-helm.

If you still want to use the docker compose setup, https://github.com/mendersoftware/mender-server/pull/110 introduces a (self-signed) demo certificate. You can use that as a basis and issue your own public certificate and mount it (for example by replacing the compose/certs/mender.crt and compose/certs/mender.key with your certificate and key respectively).

alfrunes commented 1 month ago

Another option, if you want to use compose behind a reverse proxy handling the TLS termination. You can simply disable the websecure entrypoint and the redirection as follows:

diff --git a/docker-compose.yml b/docker-compose.yml
index b363769..e597262 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -330,13 +330,13 @@ services:
       - --api.insecure=true
       - --accesslog=true
       - --entrypoints.web.address=:80
-      - --entrypoints.web.http.redirections.entryPoint.scheme=https
-      - --entrypoints.web.http.redirections.entryPoint.to=websecure
-      - --entrypoints.websecure.address=:443
-      - --entrypoints.websecure.transport.respondingTimeouts.idleTimeout=7200
-      - --entrypoints.websecure.transport.respondingTimeouts.readTimeout=7200
-      - --entrypoints.websecure.transport.respondingTimeouts.writeTimeout=7200
-      - --entrypoints.websecure.http.tls=true

This way, the mender-server is exposed in plain HTTP (no TLS) on port 80.

NTohan commented 1 month ago

Thank you for your quick response and suggestions. I came across the helm suggestion earlier too, but simply due to limited resources on my server, I went with docker compose. Nevertheless, if you are suggesting that docker compose is/will not be maintained actively like helm, I would seriously consider upgrading my server.

Unfortunately, your suggestion with exposing to plain HTTP did not work for me, I keep getting 404 page not found with http://192.168.x.x:80

diff --git a/docker-compose.yml b/docker-compose.yml
index b363769..c9fdd05 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -330,13 +330,13 @@ services:
       - --api.insecure=true
       - --accesslog=true
       - --entrypoints.web.address=:80
-      - --entrypoints.web.http.redirections.entryPoint.scheme=https
-      - --entrypoints.web.http.redirections.entryPoint.to=websecure
-      - --entrypoints.websecure.address=:443
-      - --entrypoints.websecure.transport.respondingTimeouts.idleTimeout=7200
-      - --entrypoints.websecure.transport.respondingTimeouts.readTimeout=7200
-      - --entrypoints.websecure.transport.respondingTimeouts.writeTimeout=7200
-      - --entrypoints.websecure.http.tls=true
+        #- --entrypoints.web.http.redirections.entryPoint.scheme=https
+        #- --entrypoints.web.http.redirections.entryPoint.to=websecure
+        #- --entrypoints.websecure.address=:443
+        #- --entrypoints.websecure.transport.respondingTimeouts.idleTimeout=7200
+        #- --entrypoints.websecure.transport.respondingTimeouts.readTimeout=7200
+        #- --entrypoints.websecure.transport.respondingTimeouts.writeTimeout=7200
+        #- --entrypoints.websecure.http.tls=true
       - --providers.file.directory=/etc/traefik/config
       - --providers.docker=true
       - --providers.docker.exposedByDefault=false

Also, I will try generating certificates for extra layer of security but behind reverse proxy it is more or less redundant.

alfrunes commented 1 month ago

Sorry I was too quick to apply, not actually testing my suggestion. I identified the problem and addressed it in the PR (https://github.com/mendersoftware/mender-server/pull/110/commits/cf0b9b2175e9cf99c320e623b897f0c6a57137ca). If you now try to remove the websecure entrypoint from Traefik (as I described in my previous comment) it will serve plain HTTP on port 80.

I came across the helm suggestion earlier too, but simply due to limited resources on my server, I went with docker compose. Nevertheless, if you are suggesting that docker compose is/will not be maintained actively like helm, I would seriously consider upgrading my server.

As long as it's not for a production environment. For small deployments you should be fine using this docker compose, but you might run into trouble if you need to scale the number of replicas, or load balance the application across machines. We will continue to use and improve the docker compose environment as we are also using it internally when doing system integration testing.

Also, I will try generating certificates for extra layer of security but behind reverse proxy it is more or less redundant.

As long as your ingress (proxy) is using TLS and the docker composition is not exposed outside your secure LAN you should be ok, but be aware that port 80 will be exposed to your local network.

NTohan commented 1 month ago

Thank you for your fix. I am happy to confirm that with your changes and removing the websecure entrypoint I am able to connect to port 80. It is possible to configure the default port from 80 to something else?

The only changes I have noticed that inventory information under a registered device is empty now. Also, column configuration options under table configuration are reduced to very limited. Could this be related to the recent changes? Not a blocker for me though.

Yes, you are right about the scaling and load balancing features K8s has to offer. I will definitely consider it for scaling.

Regarding TLS, it is enabled by default for reverse proxy in-use.

NTohan commented 1 month ago

Thank you for your fix. I am happy to confirm that with your changes and removing the websecure entrypoint I am able to connect to port 80. It is possible to configure the default port from 80 to something else?

Unfortunately, the deployments are constantly failing with docker compose at https://github.com/mendersoftware/mender-server/commit/cf0b9b2175e9cf99c320e623b897f0c6a57137ca and removing the websecure entrypoint

2024-10-19 20:58:30.714 +0000 UTC warning: Host not found (non-authoritative), try again later: GET https://s3.mender.local/mender/1392ab9b-95f7-43f4-b3f6-a12ed1f94ad2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241019%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241019T205829Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3D%22hello-world-container-update.mender%22&response-content-type=application%2Fvnd.mender-artifact&x-id=GetObject&X-Amz-Signature=9d1aa76146d3063870369e64ba1c99580d292a8a5c2c5684ce0d34512a992655:

Please find the complete logs in the file: deployment-log-80fcbbce-5e16-4c3d-a77b-558031d77b3e-hello-world-container-update-2024-10-19T21_08_32.096Z.log

Additional information: mender-client is installed on the target device using:

wget -O- https://get.mender.io | sudo bash -s -- --demo --force-mender-client4 -- --quiet --device-type "genericx86-64" --demo --server-url https://devices.domain.xxx --server-cert=""

mender-client is configured with the following parameters:

$ cat /etc/mender/mender.conf
{
    "HttpsClient": {},
    "Security": {},
    "Connectivity": {},
    "DeviceTypeFile": "/var/lib/mender/device_type",
    "UpdateControlMapExpirationTimeSeconds": 90,
    "UpdateControlMapBootExpirationTimeSeconds": 45,
    "UpdatePollIntervalSeconds": 5,
    "InventoryPollIntervalSeconds": 5,
    "RetryPollIntervalSeconds": 30,
    "Servers": [
        {
            "ServerURL": "https://devices.domain.xxx"
        }
    ]
}

$ cat /var/lib/mender/device_type
device_type=genericx86-64

Also, I have tried re-installing the mender-client on the target device and re-created the deployment but unfortunately the deployments are still failing.

Can you please check the issue and confirm if it is an issue within the under-development docker-compose or have I configured the mender-server not correctly on my part?

Thank you for your efforts.

alfrunes commented 1 month ago

Sorry, I forgot about the s3 bucket configuration. At this point, I think these customization deserve a docker compose override file to make it easier to explain. I can see two options for making this work.

Use an s3 bucket and configure the server to use this bucket.

# docker-compose.http.yml
# docker compose -f docker-compose.yml -f docker-compose.http.yml up -d
services:
  traefik:
    command:
      - --api=true
      - --api.insecure=true
      - --accesslog=true
      - --entrypoints.web.address=:80
      - --providers.file.directory=/etc/traefik/config
      - --providers.docker=true
      - --providers.docker.exposedByDefault=false
  deployments:
    environment:
      DEPLOYMENTS_PRESIGN_URL_HOSTNAME: "<your gateway domain name>"
      DEPLOYMENTS_PRESIGN_SECRET: "<Generate a random base64 secret, for example: head -c 16 /dev/urandom | base64 -w 0 >"
      DEPLOYMENTS_STORAGE_BUCKET: "<BUCKET_NAME>"
      DEPLOYMENTS_AWS_URI: "<https://BUCKET_NAME.AWS_REGION.amazonaws.com>"
      DEPLOYMENTS_AWS_EXTERNAL_URI: "<https://BUCKET_NAME.AWS_REGION.amazonaws.com>"
      DEPLOYMENTS_AWS_AUTH_KEY: "${AWS_ACCESS_KEY_ID}"
      DEPLOYMENTS_AWS_AUTH_SECRET: "${AWS_SECRET_ACCESS_KEY}"
  s3fs:
    scale: 0

Where DEPLOYMENTS_AWS_AUTH_KEY and DEPLOYMENTS_AWS_AUTH_SECRET is set to your secret access key for the s3 bucket.

Create a rule in the reverse proxy for forwarding to (with hostname rewrite) s3.mender.local

# docker compose -f docker-compose.yml -f docker-compose.http.yml up -d
services:
  traefik:
    command:
      - --api=true
      - --api.insecure=true
      - --accesslog=true
      - --entrypoints.web.address=:80
      - --providers.file.directory=/etc/traefik/config
      - --providers.docker=true
      - --providers.docker.exposedByDefault=false
  deployments:
    environment:
      DEPLOYMENTS_PRESIGN_URL_HOSTNAME: "<your gateway domain name>"
      DEPLOYMENTS_PRESIGN_SECRET: "<Generate a random base64 secret, for example: head -c 16 /dev/urandom | base64 -w 0 >"
      DEPLOYMENTS_AWS_PROXY_URI: "https://<your domain>/mender" # Setup a rule that maps `/mender` to `s3.mender.local` with host header rewrite.

[!IMPORTANT] If you go with option 2, make sure you using my latest version of the PR (https://github.com/mendersoftware/mender-server/pull/110/commits/bde5c31a7ca8e153683a207b599f6ddbb5cefd5f) and set MENDER_SECRET_ACCESS_KEY environment variable to a secret value as the s3 storage would be exposed with an insecure access key.

It is possible to configure the default port from 80 to something else?

Yes, simply change the port number for the web service in the traefik service, for example:

diff --git a/docker-compose.yml b/docker-compose.yml
index b363769..c9fdd05 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -330,13 +330,13 @@ services:
       - --api.insecure=true
       - --accesslog=true
-      - --entrypoints.web.address=:80
+      - --entrypoints.web.address=:8080

NTohan commented 1 month ago

Thank you for your patch and summary on possible options. I like your approach with docker compose override file.

I have tried the option 2 and Sorry to report that deployment is not working for me.

Here are the steps Step 1: Your latest changes from https://github.com/mendersoftware/mender-server/commit/bde5c31a7ca8e153683a207b599f6ddbb5cefd5f

$ git log
commit cf0b9b2175e9cf99c320e623b897f0c6a57137ca (HEAD)
Author: Alf-Rune Siqveland <alf.rune@northern.tech>
Date:   Fri Oct 18 15:07:22 2024 +0200

    chore(docker): Remove hard-coded entrypoint in routes and define default

    Signed-off-by: Alf-Rune Siqveland <alf.rune@northern.tech>

Step 2: Setup a rule that maps /mender to s3.mender.local

Step 3: Add a new docker compose override file docker-compose.http.yml

$ git diff
$ export MENDER_SECRET_ACCESS_KEY=generated_random_key
$ cat docker-compose.http.yml
services:
  traefik:
    command:
      - --api=true
      - --api.insecure=true
      - --accesslog=true
      - --entrypoints.web.address=:80
      - --providers.file.directory=/etc/traefik/config
      - --providers.docker=true
      - --providers.docker.exposedByDefault=false
  deployments:
    environment:
      DEPLOYMENTS_PRESIGN_URL_HOSTNAME: "devices.domain.xxx"
      DEPLOYMENTS_PRESIGN_SECRET: "generated_random_key" #"<Generate a random base64 secret, for example: head -c 16 /dev/urandom | base64 -w 0 >"
      DEPLOYMENTS_AWS_PROXY_URI: "https://domain.xxx/mender" # Setup a rule that maps `/mender` to `s3.mender.local` with host header rewrite.

$  docker compose -f docker-compose.yml -f docker-compose.http.yml up --build

Step 4: Try a demo deployment

Unfortunately, I keep getting this error on mender-server

Couldn't load deployments. Cannot read properties of undefined (reading 'length') Retrying in 9 seconds...

Please find the logs from _mender-deployments-1_logs (1).txt

Also, how to make sure that s3.mender.local is setup properly on my server? Is it possible to bypass the local domain s3.mender.local and replace it with something like http://<local_ip>:<port> within docker compose? This might be helpful to test if reverse-proxy is able to point to http://<local_ip>:<port> and has issue resolving to the suggested s3.mender.local.

Yes, simply change the port number for the web service in the traefik service, for example:

Thank you for the suggestion. I will try adapting the exposure port after deployments are functional.

alfrunes commented 1 month ago

I merged the PR to main with one notable change: the domain name changed from mender.local to docker.mender.io, this is to avoid potential conflicts with mDNS top-level domain (.local).

Please find the logs from _mender-deployments-1_logs (1).txt

The only thing that sticks out from the logs here is that it seems like your trying to recreate a deployment that is already in progress. The error observed seems to be coming from an unexpected response that is not handled in the frontend. I'm not sure exactly what's causing this, but I will look into it.

To test if your setup works, you could try uploading an artifact and then download it again.

Also, how to make sure that s3.mender.local is setup properly on my server?

It seems like the problem is that s3.mender.local (now docker.mender.io on main) should map to localhost, so you should add a routing entry mapping this hostname back to your localhost. That is, on Linux you need to edit /etc/hosts: echo "127.0.0.1 s3.mender.local" | sudo tee -a /etc/hosts. On Windows I believe you need to append the route to C:\Windows\System32\drivers\etc\hosts. Alternatively, you could setup a local DNS on your LAN, adding A record mapping the domain back to your private IP.

Is it possible to bypass the local domain s3.mender.local and replace it with something like http://: within docker compose? This might be helpful to test if reverse-proxy is able to point to http://: and has issue resolving to the suggested s3.mender.local.

This is possible, but I would not recommend it as containers get their IPs reassigned every time you restart the docker compose environment. But for the sake of completeness, you can forward the requests to the IP of the s3fs container which you can get by running:

docker inspect mender-s3fs-1 --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'

The S3 API is exposed on port 8333 on this IP address.

And one more thing, just to double check. I hope you've replaced the value of the MENDER_SECRET_ACCESS_KEY before running docker compose up -d. It is important that this is secret as it would otherwise give public access to the blob storage.

NTohan commented 1 month ago

It seems like the problem is that s3.mender.local (now docker.mender.io on main) should map to localhost, so you should add a routing entry mapping this hostname back to your localhost.

Okay, thank you for pointing out the changes merged to main in the meanwhile. I am now pointing to cd5f6108bb51345d12e67816f13ee1b4507c986c and mapped docker.mender.io and s3.docker.mender.io to 127.0.0.1 as mentioned in README.md and adapted my proxy rule https://domain.xxx/mender to http://docker.mender.io .

 $ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 <hostname>

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
127.0.0.1   docker.mender.io s3.docker.mender.io

[!IMPORTANT] Reverse proxy rule for mender-server to map default port 80 is setup with a subdomain devices.domain.xxx, therefore, DEPLOYMENTS_PRESIGN_URL_HOSTNAME is also mapped to devices.domain.xxx. Where as the rule for deployment DEPLOYMENTS_AWS_PROXY_URI is setup with a path domain.xxx/mender.
  DEPLOYMENTS_PRESIGN_URL_HOSTNAME: "devices.domain.xxx"
  DEPLOYMENTS_PRESIGN_SECRET: "generated_random_key" #"<Generate a random base64 secret, for example: head -c 16 /dev/urandom | base64 -w 0 >"
  DEPLOYMENTS_AWS_PROXY_URI: "https://domain.xxx/mender" # Setup a rule that maps `/mender` to `s3.mender.local` with host header rewrite.

Important note can be also observed in my last comment. I hope that is not the issues because I am not able to upload and download artifacts like you suggested to test.

I have also observed that mender-deployments-1 keeps existing and I needed to start the container manually. Therefore, I have adapted the restart-policy to unless-stopped. Nevertheless, it seems there are permission issues main: failed to setup storage client: s3: failed to check bucket preconditions: s3: insufficient permissions for accessing bucket 'mender'

time="2024-10-22T09:56:01Z" level=info msg="Deployments Service starting up" caller="main.cmdServer@main.go:159"
time="2024-10-22T09:56:01Z" level=info msg="automigrate is ON, will apply migrations" caller="mongo.Migrate@migrations.go:50"
time="2024-10-22T09:56:01Z" level=info msg="migrating deployment_service" caller="mongo.MigrateSingle@migrations.go:72"
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.1 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.2 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.3 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.4 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.5 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.6 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.7 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.9 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.10 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.11 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.13 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.14 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.15 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.16 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="migration to version 1.2.17 skipped" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:128" db=deployment_service
time="2024-10-22T09:56:01Z" level=info msg="DB migrated to version 1.2.17" caller="migrate.(*SimpleMigrator).Apply@migrator_simple.go:143" db=deployment_service
main: failed to setup storage client: s3: failed to check bucket preconditions: s3: insufficient permissions for accessing bucket 'mender'

Any ideas, what could be the reason for insufficient permissions?

Also, I have observed errors like:

traefik-1                 | 2024-10-22T09:56:00Z ERR error="service \"deployments\" error: unable to find the IP address for the container \"/mender-deployments-1\": the server is ignored" container=deployments-mender-b362e1357ad64f71ae79a8430316c357d0ef352eace0d8b75f5ecd221e0b8020 providerName=docker
traefik-1                 | 2024-10-22T09:56:00Z ERR error="service \"deployments\" error: unable to find the IP address for the container \"/mender-deployments-1\": the server is ignored" container=deployments-mender-b362e1357ad64f71ae79a8430316c357d0ef352eace0d8b75f5ecd221e0b8020 providerName=docker

And one more thing, just to double check. I hope you've replaced the value of the MENDER_SECRET_ACCESS_KEY before running docker compose up -d. It is important that this is secret as it would otherwise give public access to the blob storage.

Yes, I have created a new key for my setup but still thank you for mentioning it as I also see it worth mentioning to avoid public access to artifacts. 👍

NTohan commented 1 month ago

Okay, the issue with permission seems to be a race condition and first removing then relaunching all containers seems to fix it.

_mender-s3fs-1_logs.txt _mender-deployments-1_logs (5).txt

However, to your suggested test, I am not able to download the artifacts from mender-server either. When I click on DOWNLOAD ARTIFACT, I am being redirected to https://s3.docker.mender.io/mender/d1bae418-85ae-41bf-b7fd-a27bf866d43e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241022%2Fus-east-<...> which should be replaced according to my proxy rule to https://devices.domain.xxx/mender .

Also, replacing manually https://s3.docker.mender.io/mender/... to https://devices.domain.xxx/mender/... leads to {"error": {"status_code": 404,"message": "Not Found"}}

@alfrunes Just to be sure, can you please double check if DEPLOYMENTS_AWS_PROXY_URI is handled properly internally? Thank you very much in advance.

alfrunes commented 1 month ago

I made a typo with one of the environment variables in the override. It should be DEPLOYMENTS_STORAGE_PROXY_URI and not DEPLOYMENTS_AWS_PROXY_URI. Sorry about the inconvenience.

NTohan commented 1 month ago

Thank you for the correct environment variable name. By using the correct env variables DEPLOYMENTS_STORAGE_PROXY_URI name the re-routing to my proxy rule is fixed.

https://devices.domain.xxx/mender/b0a73540-e58a-4d28-87b2-823ee810f7f1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241022%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241022T205643Z&X-Amz-Expires=900&X-Amz-Signature=c41c5dc48c65625b527a8bf7ef7608148995393a82125306a4274b78b4c8e73f&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B+filename%3D%22hello-world-container-update.mender%22&response-content-type=application%2Fvnd.mender-artifact&x-id=GetObject

Nevertheless, I strongly believe that there is still something wrong with handling of the proxy URI within the services. At least, I can observe in mender-gui-1 logs. Below, you see two attempts to download two different artifacts manually and host name is set correctly host: "devices.domain.xxx".

2024/10/22 20:46:59 [error] 8#8: *5 open() "/var/www/mender-gui/dist/mender/a119f7e4-f665-49d1-96f2-92c6038e7521" failed (2: No such file or directory), client: 172.18.0.2, server: , request: "GET /mender/a119f7e4-f665-49d1-96f2-92c6038e7521?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241022%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241022T204620Z&X-Amz-Expires=900&X-Amz-Signature=d14cdfa1bc00665f9e372b28044b96f827eff18359b2f96da29da1ce97fd97c2&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B+filename%3D%22hello-world-container-update.mender%22&response-content-type=application%2Fvnd.mender-artifact&x-id=GetObject HTTP/1.1", host: "devices.domain.xxx"

172.18.0.2 - - [22/Oct/2024:20:46:59 +0000] "GET /mender/a119f7e4-f665-49d1-96f2-92c6038e7521?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241022%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241022T204620Z&X-Amz-Expires=900&X-Amz-Signature=d14cdfa1bc00665f9e372b28044b96f827eff18359b2f96da29da1ce97fd97c2&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B+filename%3D%22hello-world-container-update.mender%22&response-content-type=application%2Fvnd.mender-artifact&x-id=GetObject HTTP/1.1" 404 178 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"

172.18.0.2 - - [22/Oct/2024:20:46:59 +0000] "GET /404.json HTTP/1.1" 404 54 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"

.....

127.0.0.1 - - [22/Oct/2024:20:49:42 +0000] "GET /ui/ HTTP/1.1" 200 869 "-" "Wget"

2024/10/22 20:49:48 [error] 8#8: *39 open() "/var/www/mender-gui/dist/mender/10b11048-72c6-42ac-ad08-90e896ec8638" failed (2: No such file or directory), client: 172.18.0.2, server: , request: "GET /mender/10b11048-72c6-42ac-ad08-90e896ec8638?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241022%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241022T204933Z&X-Amz-Expires=900&X-Amz-Signature=c5ab07bd960987e3b077b316dbcbc4a4de633c9d4673228f68e1b66e2947a006&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B+filename%3D%22hello-world-container-update.mender%22&response-content-type=application%2Fvnd.mender-artifact&x-id=GetObject HTTP/1.1", host: "devices.domain.xxx"

172.18.0.2 - - [22/Oct/2024:20:49:48 +0000] "GET /mender/10b11048-72c6-42ac-ad08-90e896ec8638?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=mender%2F20241022%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241022T204933Z&X-Amz-Expires=900&X-Amz-Signature=c5ab07bd960987e3b077b316dbcbc4a4de633c9d4673228f68e1b66e2947a006&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B+filename%3D%22hello-world-container-update.mender%22&response-content-type=application%2Fvnd.mender-artifact&x-id=GetObject HTTP/1.1" 404 178 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"

172.18.0.2 - - [22/Oct/2024:20:49:48 +0000] "GET /404.json HTTP/1.1" 404 54 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36"

Here is the sample artifact under test : hello-world-container-update.mender.zip.

@alfrunes Can you please check with provided artifact if uploading and downloading behind external proxy like my setup is handled properly? Thank you very much in advance. I am happy to provide you with more logs if needed, please let me know.

I really appreciate your support with all the topics. 👍

alfrunes commented 1 month ago

It seems like the Traefik routing rule to SeaweedFS (s3fs service) does not fit your reverse proxy setup. The reason why you see these requests ending up in the gui service is because that is the fallback rule (if no other routes apply). It turns out that the routing to the s3 backend is done using the hostname of requests (all requests to a s3.* subdomain). I created a PR updating this rule to use path prefix instead that you could try: #124.

NTohan commented 1 month ago

Thank you for your fix. I ran some quick tests with the latest changes from https://github.com/mendersoftware/mender-server/pull/124 but unfortunately running into the same issue. Please find the logs _mender-gui-1_logs (2).txt

$ git diff
diff --git a/docker-compose.yml b/docker-compose.yml
index 16ae390..2eb6242 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -365,7 +365,7 @@ services:
     labels:
       traefik.enable: "true"
       traefik.http.routers.s3fs.priority: "99999"
-      traefik.http.routers.s3fs.rule: HostRegexp(`s3\..*`)
+      traefik.http.routers.s3fs.rule: PathPrefix(`/mender`)
       traefik.http.services.s3fs.loadBalancer.server.port: "8333"
     command: [server -s3 -s3.config /etc/seaweedfs/s3.conf]
     healthcheck:

alfrunes commented 1 month ago

Ok, this time I think it's a different issue. Your requests are routed correctly (no s3 requests falling back to the gui service). However, I discovered a different issue with the way SeaweedFS is setup in the docker compose setup. In short, the artifact data is not persisted across container recreations, only the file index. This is why you could see the artifacts in the UI, but trying to download them would timeout after retrying a couple of times. I extended the PR to address this issue as well. If you want to try it, you have to resolve the data inconsistency. The easiest way to do this is simply destroy the docker composition and bringing everything up again (from https://github.com/mendersoftware/mender-server/pull/124/commits/2b2383fcfc5e9228ed3c7d16bcd3338b173959ee):

[!WARNING] This will destroy all the data for your running instance if you need an alternative see below
 docker compose down -v --remove-orphans

Alternatively, you can destroy only the artifacts storage (which is the only corrupt part at this point):

# Destroy SeaweedFS with corrupt volume
docker compose down -v s3fs
# Remove releases/artifacts from the database
docker compose exec mongo mongosh --eval 'deployments = db.getSiblingDB("deployment_service"); deployments.images.deleteMany({}); deployments.releases.deleteMany({})'
# Bring up the composition again
docker compose up -d

NTohan commented 1 month ago

Thank you for the new patch. Unfortunately, there are still some pending functionalities that need attention.

This will destroy all the data for your running instance if you need an alternative see below

I even tried your changes with destroying all the data and re-creating new users on a test setup.

Please find the logs attached _mender-gui-1_logs (4).txt with your changes applied on top of the main branch, commit cd5f6108bb51345d12e67816f13ee1b4507c986c.

$ git diff
diff --git a/docker-compose.yml b/docker-compose.yml
index 16ae390..d6e6791 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -365,9 +365,16 @@ services:
     labels:
       traefik.enable: "true"
       traefik.http.routers.s3fs.priority: "99999"
-      traefik.http.routers.s3fs.rule: HostRegexp(`s3\..*`)
+      traefik.http.routers.s3fs.rule: PathPrefix(`/mender`)
       traefik.http.services.s3fs.loadBalancer.server.port: "8333"
-    command: [server -s3 -s3.config /etc/seaweedfs/s3.conf]
+    command:
+      - server
+      - -dir=/data
+      - -master.electionTimeout=1s
+      - -master.heartbeatInterval=250ms
+      - -master.raftHashicorp=true
+      - -s3
+      - -s3.config=/etc/seaweedfs/s3.conf
     healthcheck:
       test:
         - CMD
@@ -375,6 +382,8 @@ services:
         - "-z"
         - "127.0.0.1"
         - "8333"
+      start_period: 1m
+      start_interval: 1s
       retries: 10

   client:

alfrunes commented 1 month ago

Sorry about the mess. I thought I fixed it (again), but upon further investigation I found that the SeaweedFS deployment sometimes would not start (it appeared to be stuck in a deadlock). I did some more extensive refactoring of the SeaweedFS (S3) deployment which seems to make it a lot more reliable, could you try my new PR #136? :crossed_fingers:

I cannot see anything suspicious from the gui service logs. However, I think the issues you're experiencing is not tied to this service but rather one of the backend services. If my next PR doesn't fix it, it would be helpful to see the full logs (docker compose logs) including all services. Just make sure you redact any sensitive information (like your email, IP addresses etc.) before you upload.

NTohan commented 1 month ago

Negative, your new container for SeaweedFS seems to hang.

Please find the logs refac-seaweedfs.log

I am pointing to these changes:

$ git log
commit 5c078153058706360766b4e98256957cc019a43a (HEAD -> refac-seaweedfs, origin/refac-seaweedfs)

Cheers 👍

alfrunes commented 1 month ago

Hmm.. This time I'm not able to reproduce the issue, but the logdump makes it clear that this is an issue related to SeaweedFS. Did you try deleting all volumes and starting fresh?

I changed the raft implementation to lower the startup time (and it generally seem more mature), that could however mess up the old raft algorithm's state. If the problem persists, could you increase the log verbosity for the s3-master service


diff --git a/compose/docker-compose.seaweedfs.yml b/compose/docker-compose.seaweedfs.yml
index 28ded674..1a119134 100644
--- a/compose/docker-compose.seaweedfs.yml
+++ b/compose/docker-compose.seaweedfs.yml
@@ -26,6 +26,7 @@ services:
   s3-master:
     image: chrislusf/seaweedfs
     command:
+      - -v=5
       - master
       - -mdir=/data
       - -ip=s3-master

NTohan commented 1 month ago

I’m pleased to report that the initial results look promising after tearing down the Docker composition and bringing everything back up. I’m now able to download artifacts manually, and the deployment process is working smoothly!

I’ll run a few more tests later, write up a summary, and then close the ticket.

@alfrunes Thank you so much for your dedication and excellent work. 🥇

alfrunes commented 1 month ago

Excellent! I'm glad to hear things are looking promising. Let me know how it goes.

Cheers!

NTohan commented 1 month ago

As I mentioned in my previous comment, the mender-server is successfully running behind a reverse proxy, and the artifacts deployment with my setup is functioning smoothly. Thank you once again for your prompt support in resolving the issue. 👍 🥇