[Bug]: The PostgreSqlContainer wait strategy is not reliable (non default configuration)

0xced commented 6 months ago

Testcontainers version

3.7.0

Using the latest Testcontainers version?

Yes

Host OS

Any

Host arch

Any

.NET version

8.0.100

Docker version

Docker version 20.10.21, build baeda1f

Docker info

Client:
 Context:    desktop-linux
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.9.1)
  compose: Docker Compose (Docker Inc., v2.13.0)
  dev: Docker Dev Environments (Docker Inc., v0.0.5)
  extension: Manages Docker extensions (Docker Inc., v0.2.16)
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
  scan: Docker Scan (Docker Inc., v0.22.0)

Server:
 Containers: 5
  Running: 0
  Paused: 0
  Stopped: 5
 Images: 78
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.49-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.772GiB
 Name: docker-desktop
 ID: NURJ:VHOB:QPS5:BPZK:FS36:CRYL:2G6T:42JR:44K4:6OVH:PXOO:WPRL
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 70
  Goroutines: 96
  System Time: 2024-01-22T20:42:41.925461134Z
  EventsListeners: 27
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false

What happened?

As already briefly mentioned in https://github.com/testcontainers/testcontainers-dotnet/pull/920#issuecomment-1666129246, the default wait strategy for PostgreSqlContainer might hang forever.

There are several ways to trigger this faulty behaviour.

Reusing a volume with WithVolumeMount("Testcontainers.PostgreSql.Data", "/var/lib/postgresql/data")
Reusing the container with WithReuse(true)

Starting / stopping the container multiple times in a row:

var postgreSqlContainer = new PostgreSqlBuilder().Build();
for (var i = 0; i < 3; i++)
{
await postgreSqlContainer.StartAsync();
await postgreSqlContainer.StopAsync();
}

The logs produced when reusing an existing volume looks like this:

PostgreSQL Database directory appears to contain a database; Skipping initialization
2024-01-22 22:15:39.834 UTC [1] LOG:  starting PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-01-22 22:15:39.834 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-01-22 22:15:39.834 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-01-22 22:15:39.834 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-01-22 22:15:39.837 UTC [29] LOG:  database system was shut down at 2024-01-22 22:15:26 UTC
2024-01-22 22:15:39.845 UTC [1] LOG:  database system is ready to accept connections

We can see that database system is ready to accept connections appears only once instead of twice which is expected by the wait strategy.

When reusing the container or starting/stopping it many times in a row, the logs will grow and database system is ready to accept connections will appear more than twice.

In both cases the UntilAsync method never returns true and starting the container hangs.

Relevant log output

No response

Additional information

I have already prepared a fix for this issue that I will submit shortly.

eddumelendez commented 6 months ago

Hi, I don't think the pg_isready command is reliable enough because as you just described the postgres service inside the container logs the message twice when there is no data volume and just once when there is a data volume.

Empty postgres container lifecycle:

start container
start postgres
stop postgres
start postgres

pg_isready will succeed after step 2 and the PostgreSQLContainer will mark the container as a ready to use, however, postgres will stop again and the upcoming connection can fail, producing flaky tests.

There could be another reasons such as lack of resources.

0xced commented 6 months ago

pg_isready will succeed after step 2

Actually it won't succeed because after step 2 postgres is only listening on a socket (not on TCP/IP) for the initialisation scripts to run. In my proposed fix (#1093) I explicitly pass --host localhost (https://github.com/0xced/testcontainers-dotnet/commit/3ec3b049fb59a13d7a9e45472514a43f48069f78#diff-41b47a2befec6ab5a260ecd159f1bee1c148a94e7b0a0bd91ecf27ffdaa9cc1bR60) so that pg_isready returns a non-zero exit code while postgres is only listening on a socket. It will return success only when postgres is listening on TCP/IP, i.e. is reachable from outside the container.

HofmeisterAn commented 6 months ago

so that pg_isready returns a non-zero exit code while postgres is only listening on a socket. It will return success only when postgres is listening on TCP/IP, i.e. is reachable from outside the container.

Interesting, this is probably what I did wrong in the past. I remember that I used pg_ready too, but removed it since it was not reliable. If this is true, using the binary is indeed a better approach. Do you have any further information here? I had a quick look into the manual, but this did not help.

When reusing the container or starting/stopping it many times in a row, the logs will grow and database system is ready to accept connections will appear more than twice.

This is also a good discovery, thanks. We need to keep this in mind and may need to update other modules (wait strategies) like MongoDB too.

0xced commented 6 months ago

Do you have any further information here?

I read the discussion on https://github.com/docker-library/postgres/issues/146 then looked at the container logs and figured out that passing --host localhost could probably be the answer. During my first attempt (without --host localhost), I was getting sporadic errors as pg_isready returned sometimes success too early. I figured out that the pg_isready command automatically tries to connect through the socket, unless a host is explicitly specified.

To go from sporadic errors to reliable errors, the container in PostgreSqlContainerTest can be constructed like this:

private readonly PostgreSqlContainer _postgreSqlContainer = new PostgreSqlBuilder()
    .WithResourceMapping("sleep 10"u8.ToArray(), "/docker-entrypoint-initdb.d/init.sh")
    .Build();

This gives pg_isready plenty of time (10 seconds). Without --host localhost the StopAndStartMultipleTimes (https://github.com/testcontainers/testcontainers-dotnet/pull/1093/files#diff-85f4c7a07d7dfdf360189467e1c1b1224f976a69abeb70d545301b73f1aadea6R48) test would always fail and with --host localhost this same test would always succeed.

This is also a good discovery, thanks. We need to keep this in mind and may need to update other modules (wait strategies) like MongoDB too.

Absolutely! The new test added in #1093 (StopAndStartMultipleTimes) can be easily adapted for any container since it merely starts and stops the container multiple times in a row, making the logs grow. But I wanted to keep this pull request focused on PostgreSQL.

0xced commented 5 months ago

The new test added in https://github.com/testcontainers/testcontainers-dotnet/pull/1093 (StopAndStartMultipleTimes) can be easily adapted for any container

I just did it and found 4 containers failing the start/stop test 3 times in a row.

CosmosDb
Couchbase
MongoDb
PostgreSql

I also found another possible area if improvement: seeing exactly which tests fail and their errors at a glance in GitHub with the Test Reporter action: https://github.com/0xced/testcontainers-dotnet/actions/runs/7637708663/job/20808723402

HofmeisterAn commented 5 months ago

I just did it and found 4 containers failing the start/stop test 3 times in a row.

Unfortunately, this probably does not cover everything (we need to be able to connect too). If we only use a log message (part of it), the wait strategy may indicate readiness too early. We need to pass the start time (since) to the method that gathers the container log messages. The Test Reporter is a good idea. I will need some time to look at everything; there are too many things I need to take care of besides OSS. Thanks a lot for the contribution and the efforts you put in 🙏. Much appreciated and superfast, as always 🏎️.

HofmeisterAn commented 5 months ago

The wait strategy was adjusted in #1111.

testcontainers / testcontainers-dotnet