Closed 0xced closed 5 months ago
Hi, I don't think the pg_isready
command is reliable enough because as you just described the postgres service inside the container logs the message twice when there is no data volume and just once when there is a data volume.
Empty postgres container lifecycle:
pg_isready
will succeed after step 2 and the PostgreSQLContainer will mark the container as a ready to use, however, postgres will stop again and the upcoming connection can fail, producing flaky tests.
There could be another reasons such as lack of resources.
pg_isready
will succeed after step 2
Actually it won't succeed because after step 2 postgres is only listening on a socket (not on TCP/IP) for the initialisation scripts to run. In my proposed fix (#1093) I explicitly pass --host localhost
(https://github.com/0xced/testcontainers-dotnet/commit/3ec3b049fb59a13d7a9e45472514a43f48069f78#diff-41b47a2befec6ab5a260ecd159f1bee1c148a94e7b0a0bd91ecf27ffdaa9cc1bR60) so that pg_isready
returns a non-zero exit code while postgres is only listening on a socket. It will return success only when postgres is listening on TCP/IP, i.e. is reachable from outside the container.
so that
pg_isready
returns a non-zero exit code while postgres is only listening on a socket. It will return success only when postgres is listening on TCP/IP, i.e. is reachable from outside the container.
Interesting, this is probably what I did wrong in the past. I remember that I used pg_ready
too, but removed it since it was not reliable. If this is true, using the binary is indeed a better approach. Do you have any further information here? I had a quick look into the manual, but this did not help.
When reusing the container or starting/stopping it many times in a row, the logs will grow and
database system is ready to accept connections
will appear more than twice.
This is also a good discovery, thanks. We need to keep this in mind and may need to update other modules (wait strategies) like MongoDB too.
Do you have any further information here?
I read the discussion on https://github.com/docker-library/postgres/issues/146 then looked at the container logs and figured out that passing --host localhost
could probably be the answer. During my first attempt (without --host localhost
), I was getting sporadic errors as pg_isready
returned sometimes success too early. I figured out that the pg_isready
command automatically tries to connect through the socket, unless a host is explicitly specified.
To go from sporadic errors to reliable errors, the container in PostgreSqlContainerTest
can be constructed like this:
private readonly PostgreSqlContainer _postgreSqlContainer = new PostgreSqlBuilder()
.WithResourceMapping("sleep 10"u8.ToArray(), "/docker-entrypoint-initdb.d/init.sh")
.Build();
This gives pg_isready
plenty of time (10 seconds).
Without --host localhost
the StopAndStartMultipleTimes
(https://github.com/testcontainers/testcontainers-dotnet/pull/1093/files#diff-85f4c7a07d7dfdf360189467e1c1b1224f976a69abeb70d545301b73f1aadea6R48) test would always fail and with --host localhost
this same test would always succeed.
This is also a good discovery, thanks. We need to keep this in mind and may need to update other modules (wait strategies) like MongoDB too.
Absolutely! The new test added in #1093 (StopAndStartMultipleTimes
) can be easily adapted for any container since it merely starts and stops the container multiple times in a row, making the logs grow. But I wanted to keep this pull request focused on PostgreSQL.
The new test added in https://github.com/testcontainers/testcontainers-dotnet/pull/1093 (StopAndStartMultipleTimes) can be easily adapted for any container
I just did it and found 4 containers failing the start/stop test 3 times in a row.
I also found another possible area if improvement: seeing exactly which tests fail and their errors at a glance in GitHub with the Test Reporter action: https://github.com/0xced/testcontainers-dotnet/actions/runs/7637708663/job/20808723402
I just did it and found 4 containers failing the start/stop test 3 times in a row.
Unfortunately, this probably does not cover everything (we need to be able to connect too). If we only use a log message (part of it), the wait strategy may indicate readiness too early. We need to pass the start time (since) to the method that gathers the container log messages. The Test Reporter is a good idea. I will need some time to look at everything; there are too many things I need to take care of besides OSS. Thanks a lot for the contribution and the efforts you put in 🙏. Much appreciated and superfast, as always 🏎️.
The wait strategy was adjusted in #1111.
Testcontainers version
3.7.0
Using the latest Testcontainers version?
Yes
Host OS
Any
Host arch
Any
.NET version
8.0.100
Docker version
Docker info
What happened?
As already briefly mentioned in https://github.com/testcontainers/testcontainers-dotnet/pull/920#issuecomment-1666129246, the default wait strategy for
PostgreSqlContainer
might hang forever.There are several ways to trigger this faulty behaviour.
WithVolumeMount("Testcontainers.PostgreSql.Data", "/var/lib/postgresql/data")
WithReuse(true)
The logs produced when reusing an existing volume looks like this:
We can see that
database system is ready to accept connections
appears only once instead of twice which is expected by the wait strategy.When reusing the container or starting/stopping it many times in a row, the logs will grow and
database system is ready to accept connections
will appear more than twice.In both cases the
UntilAsync
method never returnstrue
and starting the container hangs.Relevant log output
No response
Additional information
I have already prepared a fix for this issue that I will submit shortly.