testcontainers / moby-ryuk

Schedule Moby/Docker containers cleanup after specific delay.
https://www.testcontainers.com
MIT License
182 stars 33 forks source link

Add HEALTHCHECK to Linux Docker image. #128

Closed emetsger closed 3 months ago

emetsger commented 5 months ago

Summary

I ran across an issue using testcontainers in golang, specifically when testcontainers are used by multiple test packages in a single project. Go builds a test binary for each package and runs them simultaneously.

Example project tree, where test binaries for packages a, b, c, and d will run simultaneously, attempting to use a testcontainer:

.
├── a
│   └── a_test.go
├── b
│   └── b_test.go
├── c
│   └── c_test.go
├── d
│   └── d_test.go
├── go.mod
└── go.sum

One of the test binaries will win the race to create the ryuk container, and it becomes present in the list of docker containers returned by the Docker API.

The problem is that it the ryuk container can be exposed by the Docker API before the container is ready to accept connections. A testcontainer client can attempt to connect to the container before it is accepting connections.

This results in error messages tests like:

2024/04/21 22:25:10 🔥 Reaper obtained from Docker for this test session 99998b629eea57023f49aefc9eda5406c84ce4c91d0c6005c1b8c00dcbbce0d7
2024/04/21 22:25:10 failed to start container: dial tcp 127.0.0.1:32917: connect: connection refused: Connecting to Ryuk on localhost:32917 failed: connecting to reaper failed: failed to create container

Clone this test case repo and run go test -count=1 -v ./... and observe the errors.

Run go test -count=1 -p 1 -v ./... and tests should pass.

Prior issues

Race conditions (and the connection refused error message) have been the topic of previous issues, c.f.

But this PR addresses the situation described here by adding a simple healthcheck using netstat (already present in the alpine image).

Future Work

Once healthcheck information is exposed by the ryuk container, the lookupReaperContainer func can retry until the status is healthy.

I'm not on a Windows platform, and I'm not familiar with nanoserver and whether it comes with Powershell, and whether a command like netstat -an | Select-String 8080 will work. This PR only adds a HEALTHCHECK for the linux platform.

Unintended Consequences

Adding a healthcheck to ryuk may increase its perceived startup time.

Example Healthcheck Output

Before healthcheck (current state, i.e. main):

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 25660,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2024-04-22T01:27:19.276032615Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

After healthcheck (this PR):

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 1703,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2024-04-22T01:46:01.038259025Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Health": {
                "Status": "healthy",
                "FailingStreak": 0,
                "Log": [
                    {
                        "Start": "2024-04-22T01:46:03.039126401Z",
                        "End": "2024-04-22T01:46:03.102456526Z",
                        "ExitCode": 0,
                        "Output": ""
                    },
                    {
                        "Start": "2024-04-22T01:46:05.107285902Z",
                        "End": "2024-04-22T01:46:05.151179319Z",
                        "ExitCode": 0,
                        "Output": ""
                    },
                    {
                        "Start": "2024-04-22T01:46:07.154951153Z",
                        "End": "2024-04-22T01:46:07.243256403Z",
                        "ExitCode": 0,
                        "Output": ""
                    },
                    {
                        "Start": "2024-04-22T01:46:09.247204821Z",
                        "End": "2024-04-22T01:46:09.319083404Z",
                        "ExitCode": 0,
                        "Output": ""
                    }
                ]
            }
        },
emetsger commented 4 months ago

Does Go create a single Ryuk instance for all packages?

Yes, exactly.

Does one package connect successfully and the others fail? Does it retry to connect if it fails?

One package typically connects successfully. With the other packages, it's a race: if they get the container before it's ready to accept connections, they fail.

In my testing, all remaining packages fail. But I can imagine in a larger codebase, some packages would succeed.

How does .NET work: is it a single ryuk instance? For example, can you launch two tests in parallel and re-create the behavior (issue)?

emetsger commented 4 months ago

With respect to the languages/platforms supported by testcontainers:

If other platforms are affected by this same issue, then they'll need a patch similar to https://github.com/testcontainers/testcontainers-go/pull/2508, which actually uses the results of the healthcheck before returning a (healthy) ryuk container.

HofmeisterAn commented 4 months ago

How does .NET work: is it a single ryuk instance? For example, can you launch two tests in parallel and re-create the behavior (issue)?

Thanks for the quick explanation. .NET behaves different here. It creates a Ryuk instance for each test assembly (test project). Test assemblies are kind of isolated from each other. We do not need to deal with synchronization here.

emetsger commented 4 months ago

I did encounter some things that I can't explain during my investigation, so I wanted to add them here for full transparency.

TL;DR: I failed to identify which component was denying the connection.


I have an overly-simple mental model of how Docker API+Docker Engine work together with their containers, so these issues could just be my ignorance.

The error message indicates the connection was denied, not a timeout.

So who denied the connection? The Docker engine? It wasn't the ryuk container (see below).

I did not observe the denied TCP connection

When I was troubleshooting, I used tcpdump - on the docker vm itself - to obtain traffic destined for ryuk. I could see successful connections/traffic, but I never saw a three-way handshake attempt that was denied.

mdelapenya commented 4 months ago

Out of curiosity and to understand and compare the behavior (issue) with the .NET implementation: Does Go create a single Ryuk instance for all packages? Does one package connect successfully and the others fail? Does it retry to connect if it fails?

hey @HofmeisterAn, yes, Go creates one Ryuk container per test session, which implies some synchronisation logic to check if the ryuk container is in the running state before creating a new one.

@emetsger for the repro repository, I cloned it and it works for me with both values of -p. I tried running it with Docker Desktop and Testcontainers Cloud, with the same result:

Using all processors

=== RUN   TestA
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: f234fbe5-4793-4cf1-923b-46edffa8668b
2024/05/09 07:32:17 🐳 Creating container for image testcontainers/ryuk:0.7.0
2024/05/09 07:32:17 ✅ Container created: 783b9d4e2406
2024/05/09 07:32:17 🐳 Starting container: 783b9d4e2406
2024/05/09 07:32:18 ✅ Container started: 783b9d4e2406
2024/05/09 07:32:18 🚧 Waiting for container id 783b9d4e2406 image: testcontainers/ryuk:0.7.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/05/09 07:32:18 🔔 Container is ready: 783b9d4e2406
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: e0cb0246b173
2024/05/09 07:32:18 🐳 Starting container: e0cb0246b173
2024/05/09 07:32:19 ✅ Container started: e0cb0246b173
2024/05/09 07:32:19 🚧 Waiting for container id e0cb0246b173 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003615b8 Strategies:[0x1400037d9b0]}
2024/05/09 07:32:21 🔔 Container is ready: e0cb0246b173
2024/05/09 07:32:21 🐳 Terminating container: e0cb0246b173
2024/05/09 07:32:21 🚫 Container terminated: e0cb0246b173
--- PASS: TestA (4.10s)
PASS
ok      tc-debug/a  4.281s
=== RUN   TestB
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: 947a4325-eb63-4695-840c-87c38885c7ba
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: 8dba32be8360
2024/05/09 07:32:18 🐳 Starting container: 8dba32be8360
2024/05/09 07:32:18 ✅ Container started: 8dba32be8360
2024/05/09 07:32:18 🚧 Waiting for container id 8dba32be8360 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003d15d0 Strategies:[0x140003ed9e0]}
2024/05/09 07:32:21 🔔 Container is ready: 8dba32be8360
2024/05/09 07:32:21 🐳 Terminating container: 8dba32be8360
2024/05/09 07:32:21 🚫 Container terminated: 8dba32be8360
--- PASS: TestB (3.79s)
PASS
ok      tc-debug/b  4.237s
=== RUN   TestC
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: f38b47e6-f595-404a-8cb0-2534fce5916a
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: 614b2727db5c
2024/05/09 07:32:18 🐳 Starting container: 614b2727db5c
2024/05/09 07:32:18 ✅ Container started: 614b2727db5c
2024/05/09 07:32:18 🚧 Waiting for container id 614b2727db5c image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x14000393698 Strategies:[0x140003b39b0]}
2024/05/09 07:32:21 🔔 Container is ready: 614b2727db5c
2024/05/09 07:32:21 🐳 Terminating container: 614b2727db5c
2024/05/09 07:32:21 🚫 Container terminated: 614b2727db5c
--- PASS: TestC (3.88s)
PASS
ok      tc-debug/c  4.201s
=== RUN   TestD
2024/05/09 07:32:18 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
  API Version: 1.43
  Operating System: Ubuntu 22.04.4 LTS
  Total Memory: 15779 MB
  Resolved Docker Host: tcp://127.0.0.1:62301
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
  Test ProcessID: 5d324800-b68a-4687-90d1-b1216078c8f9
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: a533d8ad9249
2024/05/09 07:32:18 🐳 Starting container: a533d8ad9249
2024/05/09 07:32:18 ✅ Container started: a533d8ad9249
2024/05/09 07:32:18 🚧 Waiting for container id a533d8ad9249 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003f95d0 Strategies:[0x140004659e0]}
2024/05/09 07:32:21 🔔 Container is ready: a533d8ad9249
2024/05/09 07:32:21 🐳 Terminating container: a533d8ad9249
2024/05/09 07:32:21 🚫 Container terminated: a533d8ad9249
--- PASS: TestD (3.83s)
PASS
ok      tc-debug/d  4.425s
lachauj commented 4 months ago

for the repro repository, I cloned it and it works for me with both values of -p. I tried running it with Docker Desktop and Testcontainers Cloud, with the same result:

@mdelapenya I have the same result with testcontainers-go and Docker Desktop, but when I run Docker with Colima I randomly have the issue when TestContainers wants to connect to Ryuk (95% of the time it triggers the issue)

mdelapenya commented 4 months ago

Pinging @eddumelendez and @kiview as this is a cross-lang feature.

kiview commented 4 months ago

Can we decouple the switch to using a HEALTHCHECK from fixing the race condition? While I am not against using a HEALTHCHECK here per-se, I would prefer to first fix the race condition (this should be possible with the current waiting logic, right?). Also, I am fairly certain, that HEALTHCHECK is not supported by all potential container runtimes (e.g. Podman). While we generally aim for first-class support of Docker, adding a substantial regression for alternative container runtimes might send a bad signal to the community, especially if it can be avoided. And last, we should benchmark the performance impact of switching to HEALTCHECK as well (if we decide to use it).

@HofmeisterAn I guess we didn't fully sync up on the current approach to test session semantics (and technical measures to achieve them) across languages, back when we introduced this in Go (and AFAIK in Node as well). From a technical PoV, Go and .NET should be in a similar situation.

mdelapenya commented 3 months ago

Closing as per https://github.com/testcontainers/testcontainers-go/pull/2508#issuecomment-2168475784

Thanks!