Closed cvetomir-todorov closed 4 months ago
Thank you for sharing the issue. Could you please debug into the wait strategy and check the response information (stdout, stderr) in the ExecResult
response? This should give us a better understanding of why it fails or does not succeed in the first couple of tries. Please be aware that, in general, it is not a good idea to rely on the port. Usually, ports are available before the actual service is running, which results in flakiness.
@HofmeisterAn thanks for your quick reply. I got the following output repeatedly. The file with the output is 738 lines long, so approximately around 367 times. I added new lines here and there in order to see it without horizontal scrollbars, but bear in mind this was a single line.
OCI runtime exec failed:
exec failed: unable to start container process:
exec: "true && (grep -i ':0*2352' /proc/net/tcp* || nc -vz -w 1 localhost 9042 || /bin/bash -c '</dev/tcp/localhost/9042')":
stat true && (grep -i ':0*2352' /proc/net/tcp* || nc -vz -w 1 localhost 9042 || /bin/bash -c '</dev/tcp/localhost/9042'):
no such file or directory: unknown
EDIT:
When I docker exec -it <container-name> bash
and I execute true && (grep -i ':0*2352' /proc/net/tcp* || nc -vz -w 1 localhost 9042 || /bin/bash -c '</dev/tcp/localhost/9042')
the result is /proc/net/tcp: 1: 00000000:2352 00000000:0000 0A 00000000:00000000 00:00000000 00000000 999 0 7267910 1 0000000000000000 100 0 0 10 0
. So I am not sure at the moment where this error is coming from.
Given the initial text of the error exec failed: unable to start container process:
there is probably something incompatible with the way commands are being executed in the container, am I on the right path? If so, how could I fix that?
Or is it related to no such file or directory: unknown
? I assume the name should be different though...
I immediately test it using
nc -vz -w 1 localhost 9042
which is part of this particular wait strategy implementation.
Ensuring I have not misunderstood anything, could you clarify where you ran nc
? Was it inside the container or from your host? I briefly checked, and it seems nc
is not available in the image (container):
/bin/sh: 1: nc: not found
If you ran it from your host (which I assume you did), then there is no issue. The host port that forwards the connection to the container is likely available much earlier than the port inside the container.
It takes approximately a minute until I see the following log message from the container:
Starting listening for CQL clients on /0.0.0.0:9042
This aligns with what you are observing. I would recommend using the log message wait strategy here or reviewing Java's implementation and aligning with their wait strategy.
@HofmeisterAn yep, you are correct about how I was running nc
, since I also saw that it is not present in the image. I thought that if there is a port binding, then checking for the port would be delegated as well, but seems that it is not how it is working π
Mainly I was expecting naively some timeout to play a role since things started working after 1 min. But the idea to sync my expectations with the Cassandra logs didn't occur to me.
I didn't know there is a Java implementation, so I am going to search for it and check it out in order to see if something can be borrowed. Thanks for your invaluable input and know-how about making stuff work with Testcontainers. From my part I think the issue should be closed for now.
@HofmeisterAn I investigated the Java implementation which executes a command against Cassandra. Then I applied a log message wait strategy, as advised, and a new strategy which executes a command against the database (simply checks for the existence of a Cassandra keyspace). Two strategies in succession.
This works fine locally, but when I run the code in Github Actions I get the Cassandra-specific NoHostAvailableException
which is self-explanatory. I am using the container.Hostname
from within the wait strategy, which based on the logs, resolves to 127.0.0.1
. Having read the documentation I thought it is not advisable to use such values. Is there any Github Actions-specific issues related to running containers? Is there a way to troubleshoot this? I couldn't find anything specific in the existing issues in this repo, but if I have missed something, could you let me know?
The code is in this file here: https://github.com/cvetomir-todorov/CecoChat/blob/test-chats-service/source/CecoChat.Chats.Testing/TestContainers.cs
The Github Actions workflow is here: https://github.com/cvetomir-todorov/CecoChat/actions/runs/9269353841 (the error logs could be more easily accessible by viewing the failing step from the bottom)
Having read the documentation I thought it is not advisable to use such values.
It is not recommended to use a constant or a fixed value like 127.0.0.1
. Depending on the container runtime and configuration, the host may differ. Testcontainers takes care of this by resolving the correct host. For GitHub, 127.0.0.1
is correct.
Is there any Github Actions-specific issues related to running containers?
No. All our tests run on GitHub. Many of my projects run on Azure DevOps, which basically uses the same agents.
Is there a way to troubleshoot this? I couldn't find anything specific in the existing issues in this repo, but if I have missed something, could you let me know?
Your wait strategy configuration looks incorrect. You are overriding the first one with the second. It should be the following configuration instead (chained):
.WithWaitStrategy(Wait.ForUnixContainer()
.UntilMessageIsLogged("Starting listening for CQL clients on /0.0.0.0:9042")
.UntilCassandraQueryExecuted(port, localDc))
Please consider a longer timeout as well, to ensure it is not just the slow agent. Starting the container on my beefy machine already takes a minute. The pipeline needs to pull the image too.
If it still fails after these adjustments, I would suggest adding a stopwatch to measure how long it takes and exporting the container logs before disposing the container (to ensure that Cassandra (the service) is really running):
var (stdout, stderr) = await _cassandra.GetLogsAsync();
Furthermore, consider using random host ports to avoid port clashes. You never know which services are running on the build agent (or other machines) and occupying ports in your range.
@HofmeisterAn thanks for sharing the advice:
tried [::1]:46021: SocketException 'Connection refused'
as if no one is listening
WithPortBinding(_cassandraHostPort, cassandraContainerPort)
stdout
spits out Starting listening for CQL clients on /0.0.0.0:9042
Docker container <id> ready
πAny way to check if the container is actually running instead of just crashed/stopped? When I run the code on my machine I do not get Connection refused
but rather Transport endpoint is not connected
before eventually it succeeds. I'd suppose stdout/stderr would contain info why the container would crash/stop but it's identical to when I run the code locally...
- but still the error that I get consistently is
tried [::1]:46021: SocketException 'Connection refused'
Connect via IPv4.
@HofmeisterAn not only that solved the issue, but it helped me see a flawed logic about the Cassandra client wrapped I had written. Now the test is finally green. Thanks!
As for the TC API allowing overwriting the wait strategy - isn't a defensive approach better? I mean telling the client code that overwriting it is not OK by throwing an exception for example?
I mean telling the client code that overwriting it is not OK by throwing an exception for example?
Do you mean throwing an exception when a wait strategy is already configured? TBH, I never thought about that, but I think it wouldn't work very well with our module approach. Testcontainers' modules are pre-configured following best practice configurations, which includes the wait strategy. There might be cases where developers want to override the default configuration for whatever reason. Maybe because they use a custom image that requires it - I am not sure π€·ββοΈ.
Testcontainers version
3.8.0
Using the latest Testcontainers version?
Yes
Host OS
Ubuntu 22.04
Host arch
x64
.NET version
8.0.300
Docker version
Docker info
What happened?
I am starting Cassandra NoSQL database using the following code in my tests:
It always takes around 58 seconds in order for
StartAsync
to complete. In the meantime the container has been started and port has been open almost immediately after the call. After I execute the above code I immediately test it usingnc -vz -w 1 localhost 9042
which is part of this particular wait strategy implementation. The response isConnection to localhost (127.0.0.1) 9042 port [tcp/*] succeeded!
. In the log output below you can see the log from testcontainers. It is stubbornly repeating the commands again and again without detecting that thenc
command should actually succeed. I tried running my tests without the wait strategy but the first test fails since the port isn't really open yet.Relevant log output
Additional information
No response