testcontainers / testcontainers-dotnet

A library to support tests with throwaway instances of Docker containers for all compatible .NET Standard versions.
https://dotnet.testcontainers.org
MIT License
3.67k stars 254 forks source link

[Enhancement]: Workaround for moby bug where random ports in a high range causes intermittent race conditions #1085

Closed martimors closed 6 months ago

martimors commented 6 months ago

Problem

It has previously been brought up multiple times across the testcontainers SDKs (eg. https://github.com/testcontainers/testcontainers-dotnet/issues/386) that the host port for a container can suddenly change right after the container has started. This causes tests to sometimes pass and sometimes fail, depending on what happened before; port changing or connection to the container from the host.

The underlying issue here seems to be https://github.com/moby/moby/issues/42442 , which has been open since 2021 without being fixed. It thus only happens on some systems and not all.

Solution

Since the many projects already decided to work around this issue by using testcontainers in a way that specifically selects an open port instead of auto-assigning a high random port (https://github.com/Kong/kubernetes-ingress-controller/pull/5005 for example), I suggest making a permanent workaround as a feature in the .NET SDK. It can be automatic behind the scenes, or opt-in.

Benefit

The project won't randomly fail tests on ipv6 enabled clients.

Alternatives

Probably, I'm not too familiar with this project yet.

Would you like to help contributing this enhancement?

No

HofmeisterAn commented 6 months ago

Yes, this is a known issue. I'm curious where and how you encountered it. This problem should already be "resolved", considering this comment: https://github.com/docker/for-mac/issues/5588#issuecomment-934600089, incl. binding ports solely to the IPv4 host IP (respectively resolving the right port to the right address IPv4/IPv6). Since .NET considers the empty string approach mentioned in the comment, port binding works reliable (AFAIK).

martimors commented 6 months ago

We are only seeing this on MacOS clients. Here is the setup:

Client:
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.21.3
 Git commit:        afdd53b4e3
 Built:             Thu Oct 26 07:06:42 2023
 OS/Arch:           darwin/arm64
 Context:           default

Server:
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:08:29 2023
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.6.25
  GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

The difference between the MacOS clients and the Windows + WSL clients that don't have any issues are

Our GitHub Actions runners also don't have any issues, running linux/amd64 architecture and ubuntu OS.

Here is the current horrendous workaround:

        var _postgresSqlContainer = new PostgreSqlBuilder().Build();
        await _postgresSqlContainer.StartAsync();
        Thread.Sleep(1000); // Adding this line fixes the issue, because we then get the correct port for the container when connecting
HofmeisterAn commented 6 months ago

I believe this issue is related to Colima and does not occur on Docker (Desktop) + Testcontainers for .NET anymore. Testcontainers for .NET only binds and resolves the IPv4 address; it will not bind both IPv4 and IPv6, hence we never resolve a wrong port (at least I haven't seen an error since we considered the comment mentioned above, although the fix might clash with other container runtimes). The Node.js implementation take an additional step by resolving the right port to the right IP endpoint (https://github.com/testcontainers/testcontainers-dotnet/issues/825).

Testcontainers is Docker API compatible. There might be behaviors in other runtimes we do not support yet. There's an issue to extend container runtime support, specifically for Podman. I haven't tested Colima in a while. IIRC, I had it running successfully, but I don't remember encountering similar issues. But I remember @cristianrgreco was working on Colima support too (https://github.com/testcontainers/testcontainers-node/pull/531), and he encountered several issues.

  1. Colima port forwarding delays: Because of the way Colima works, a container port can be bound, but inaccessible for seconds until it is port forwarded. We therefore run into the strange case where a container can bind a port, write a log message to say that it's ready, and yet when we try to connect we get connection refused. This PR attempted to work around it by changing all wait strategies to be a composite of its existing wait strategy + port wait strategy. This has polluted the codebase with a very specific Colima issue, to the extent where tests for other wait strategies have to include the port wait strategy: https://github.com/abiosoft/colima/issues/71.

Instead of blocking the entire thread, you can probably utilize the startup callback API:

WithStartupCallback((_, _) => Task.Delay(TimeSpan.FromSeconds(3)))

Unfortunately, I don't have access to a Mac anymore. TBH, I'm not sure how to proceed from here. Increasing support for other container runtimes is welcome, but I don't think I can offer much help here.

martimors commented 6 months ago

That's unfortunate, but makes a lot of sense to me. Thank you for the thorough and researched reply, I think this will be useful for many people bewildered by this issue. I might just try out rancher desktop for MacOS, to see if that behaves a bit nicer.