[Bug]: Intermittant "The response ended prematurely" with Couchbase

mgroves commented 7 months ago

Testcontainers version

3.6.0

Using the latest Testcontainers version?

Yes

Host OS

Windows

Host arch

x64

.NET version

8.0

Docker version

Client:
 Cloud integration: v1.0.35+desktop.5
 Version:           24.0.6
 API version:       1.43
 Go version:        go1.20.7
 Git commit:        ed223bc
 Built:             Mon Sep  4 12:32:48 2023
 OS/Arch:           windows/amd64
 Context:           default

Server: Docker Desktop 4.25.2 (129061)
 Engine:
  Version:          24.0.6
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.7
  Git commit:       1a79695
  Built:            Mon Sep  4 12:32:16 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.22
  GitCommit:        8165feabfdfe38c65b599c4993d227328c231fca
 runc:
  Version:          1.1.8
  GitCommit:        v1.1.8-0-g82f18fe
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker info

Client:
 Version:    24.0.6
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2-desktop.5
    Path:     C:\Program Files\Docker\cli-plugins\docker-buildx.exe
  compose: Docker Compose (Docker Inc.)
    Version:  v2.23.0-desktop.1
    Path:     C:\Program Files\Docker\cli-plugins\docker-compose.exe
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.1.0
    Path:     C:\Program Files\Docker\cli-plugins\docker-dev.exe
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.20
    Path:     C:\Program Files\Docker\cli-plugins\docker-extension.exe
  init: Creates Docker-related starter files for your project (Docker Inc.)
    Version:  v0.1.0-beta.9
    Path:     C:\Program Files\Docker\cli-plugins\docker-init.exe
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
    Version:  0.6.0
    Path:     C:\Program Files\Docker\cli-plugins\docker-sbom.exe
  scan: Docker Scan (Docker Inc.)
    Version:  v0.26.0
    Path:     C:\Program Files\Docker\cli-plugins\docker-scan.exe
  scout: Docker Scout (Docker Inc.)
    Version:  v1.0.9
    Path:     C:\Program Files\Docker\cli-plugins\docker-scout.exe

Server:
 Containers: 3
  Running: 0
  Paused: 0
  Stopped: 3
 Images: 8
 Server Version: 24.0.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
 runc version: v1.1.8-0-g82f18fe
 init version: de40ad0
 Security Options:
  seccomp
   Profile: unconfined
 Kernel Version: 5.15.133.1-microsoft-standard-WSL2
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.76GiB
 Name: KraytDragon
 ID: c7775911-df14-4266-80e5-7513018d14f8
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5555
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
WARNING: daemon is not using the default seccomp profile

What happened?

Trying to use Couchbase testcontainers in an NUnit test. Here's the intialization:

var couchbaseContainer = new CouchbaseBuilder()
      .WithImage("couchbase:enterprise-7.2.2").Build();

wait couchbaseContainer.StartAsync();

This works fine around 50% of the time. The other 50% results in:

System.Net.Http.HttpRequestException : An error occurred while sending the request. ----> System.Net.Http.HttpIOException : The response ended prematurely. (ResponseEnded)

I've traced this to CouchbaseBuilder.cs -> ConfigureCouchbaseAsync method. The first httpClient.SendAsync is where the exception is happening. Maybe the earlier WaitUntilNodeIsReady in the method isn't enough, and the rename node endpoint isn't quite ready?

So, I wrapped it in a crude retry:

            var maxRetryAttempts = 3;
            var retryDelay = TimeSpan.FromSeconds(5);

            for (int attempt = 1; attempt <= maxRetryAttempts; attempt++)
            {
                try
                {
                    using (var request = new RenameNodeRequest(container))
                    {
                        using (var response = await httpClient.SendAsync(request, ct).ConfigureAwait(false))
                        {
                            await EnsureSuccessStatusCodeAsync(response).ConfigureAwait(false);
                            break; // Success, exit the loop
                        }
                    }
                }
                catch (Exception ex) when (attempt < maxRetryAttempts)
                {
                    // TODO: Log the exception

                    // Wait before retrying
                    await Task.Delay(retryDelay, ct);
                }
            }

And that seems to have stopped the exceptions. However, not being an expert in testcontainers, there might be something else I can do instead? Or maybe there's a better retry mechanism already in testcontainers that I can use instead of this? There are a bunch more HTTP requests in this method, maybe they could all benefit from a retry?

Any thoughts on this, or should I just submit a PR similar to this and go on with my life?

Relevant log output

Relevant log when successful:

[testcontainers.org 00:00:00.13] Connected to Docker:
  Host: npipe://./pipe/docker_engine
  Server Version: 24.0.6
  Kernel Version: 5.15.133.1-microsoft-standard-WSL2
  API Version: 1.43
  Operating System: Docker Desktop
  Total Memory: 7.76 GB
[testcontainers.org 00:00:00.32] Docker container a88bde86f517 created
[testcontainers.org 00:00:00.39] Start Docker container a88bde86f517
[testcontainers.org 00:00:00.76] Wait for Docker container a88bde86f517 to complete readiness checks
[testcontainers.org 00:00:00.77] Docker container a88bde86f517 ready
[testcontainers.org 00:00:00.89] Docker container 17237d653426 created
[testcontainers.org 00:00:00.91] Start Docker container 17237d653426
[testcontainers.org 00:00:02.52] Couchbase container is starting, performing configuration.
[testcontainers.org 00:00:25.54] Wait for Docker container 17237d653426 to complete readiness checks
[testcontainers.org 00:00:27.65] Docker container 17237d653426 ready
[testcontainers.org 00:00:27.65] Couchbase container is ready! UI available at http://127.0.0.1:37798/

Relevant log when failing:

System.Net.Http.HttpRequestException : An error occurred while sending the request.
  ----> System.Net.Http.HttpIOException : The response ended prematurely. (ResponseEnded)

Additional information

No response

HofmeisterAn commented 7 months ago

The Couchbase module has not been tested with the enterprise version or the minor version 7.2. I do not know if any APIs have changed. IIRC, Java's implementation distinguish between the community and enterprise edition. This is not the case for .NET (the enterprise version is not specifically supported). There might be even more configurations necessary to support the enterprise version.

Any thoughts on this, or should I just submit a PR similar to this and go on with my life?

As you may notice, the implementation of the Couchbase builder is one of the most complex ones we have in Testcontainers. Setting up Couchbase (getting it running) requires quite a few steps.

I do not believe this is specifically a Testcontainers issue. Have you seen this comment? You might be more familiar with Couchbase, its REST API, and various configuration methods, but this does not seem like a proper fix. I favor getting Couchbase into a proper state and not calling the REST APIs over and over again.

brantburnett commented 7 months ago

@HofmeisterAn

I generally agree adding retries is non-ideal. However, in my tangentially related project https://github.com/brantburnett/couchbasefakeit I found it was basically required. Couchbase tends to startup in pieces as various components come online and, unfortunately, there is no good way to know when some of these components are actually online. There is even an issue to track the limitation in general: https://issues.couchbase.com/browse/MB-11484

mgroves commented 7 months ago

Thank you @HofmeisterAn and @brantburnett for taking a look!

The Couchbase module has not been tested with the enterprise version or the minor version 7.2. I do not know if any APIs have changed.

Yes, you're right, sorry, I should have mentioned. I think this a red herring; I'm getting the same problem with 7.0.2. I don't think any of the relevant APIs have changed between 7.0.2 and 7.2.

Have you seen this comment?

For whatever reason, I didn't notice that, even though I was browsing around in that code area. I have a blind spot for comments sometimes 😓

I do not believe this is specifically a Testcontainers issue.

I didn't expect it to be a testcontainers issue, but I concur with @brantburnett in that the various parts of Couchbase come online in no particular order. I don't think there's any proper ordering of all the HTTP requests to the API that won't eventually lead to a similar condition.

Also, I should mention that I've tried all kinds of wait strategies before I even considered looking at the testcontainers code, to no avail. I believe those wait strategies get applied after the HTTP requests in question.

I'm open to other approaches, but after running integration tests all day today with my hack in place, I haven't had a single testcontainers exception. Certainly my hacky try/catch/loop can be improved upon with something like a Polly policy that can be reused for each HTTP request, or, based on my half-day of empirical data, just the first HTTP request.

Have a great weekend!

HofmeisterAn commented 7 months ago

Yes, you're right, sorry, I should have mentioned. I think this a red herring; I'm getting the same problem with 7.0.2. I don't think any of the relevant APIs have changed between 7.0.2 and 7.2.

No worries. It was just my first assumption that the version/edition is simply not compatible. I did some tests with both versions and editions and cannot reproduce it. But since I initially ran into this problem too, it looks like the comment I mentioned is likely a race condition (and it just makes it less likely to run into it).

I still try to understand and fix the root cause, especially because Java does not run into this issue (I guess it does not).

I believe those wait strategies get applied after the HTTP requests in question.

Yes, the wait strategies are executed after the start-up callback. The start-up callback initially configures the Couchbase node.

Could you check the actual HTTP response code (EnsureSuccessStatusCode)? Maybe that will help us, together with the Couchbase API docs, to understand what goes wrong. Perhaps it requires the retry logic only for certain HTTP status codes.

Of course, we can add the retry logic if necessary, but as mentioned, I prefer to fix the root cause and not fight symptoms.

mgroves commented 7 months ago

Seems like turning Fiddler on greatly reduces the amount of times that this bug happens! I can't really use .NET/debugging as far as I know, since I believe the request fails and throws an exception before even reaching EnsureSuccessStatusCodeAsync.

I tried about a dozen times, and wasn't able to catch the RenameNodeRequest (Naming a node) portion failing, but I did catch the next one in the sequence, SetupNodeServicesRequest (Assigning Services endpoint).

POST http://127.0.0.1:56042/node/controller/setupServices HTTP/1.1 Host: 127.0.0.1:56042 Content-Type: application/x-www-form-urlencoded Content-Length: 32

services=kv%2Cindex%2Cn1ql%2Cfts HTTP/1.1 504 Fiddler - Receive Failure Date: Mon, 04 Dec 2023 19:21:21 GMT Content-Type: text/html; charset=UTF-8 Connection: close Cache-Control: no-cache, must-revalidate Timestamp: 14:21:21.462

[Fiddler] ReadResponse() failed: The server did not return a complete response for this request. Server returned 0 bytes.

Full Fiddler report

It seems to be a "refused connection" situation. I'm not sure if there's any other way to handle this than a retry.

The REST API docs only show responses for 200, 401, 404, and 400.

Also, it doesn't seem like it's a timeout issue, since default is 100 seconds, I don't see that value being set anywhere in testcontainers code, and this 504 is failing well before 100 seconds.

HofmeisterAn commented 7 months ago

It seems to be a "refused connection" situation. I'm not sure if there's any other way to handle this than a retry.

I did not find any further information that helps with that issue either. I was thinking about the retry logic. What do you think about the following implementation below? Perhaps we can create an upstream issue too. I imagine we are not the only ones who run into this issue. However, I am still wondering if Java is affected too (but of course, there are a lot of timings involved).

private sealed class RetryHandler : DelegatingHandler
{
    private const int MaxRetries = 5;

    public RetryHandler()
        : base(new HttpClientHandler())
    {
    }

    protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        for (var _ = 0; _ < MaxRetries; _++)
        {
            try
            {
                return await base.SendAsync(request, cancellationToken)
                    .ConfigureAwait(false);
            }
            catch (HttpRequestException)
            {
                await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken)
                    .ConfigureAwait(false);
            }
        }

        throw new Exception($"Unable to configure Couchbase. The HTTP request '{request.RequestUri}' did not complete successfully.");
    }
}

This allows us to pass the RetryHandler to the HttpClient instance.

Edit: I think I forgot to mention that I would like to avoid adding additional dependencies such as Polly. I prefer to include as few dependencies as possible to minimize dependency clashes (version conflicts).

mgroves commented 7 months ago

That seems good to me, especially since it would just be one simple change to the code in ConfigureCouchbaseAsync. Would you be okay with putting Polly in there, and perhaps some kind of backoff, exponential or otherwise? (I understand if you don't want to add another dependency; I'm happy to create the PR either way). Just saw the edit about Polly :)

HofmeisterAn commented 7 months ago

Would you be okay with putting Polly in there

I have just updated my comment after you replied 😃. See my edit. If there is not a real need, I would prefer to avoid adding Polly.

mgroves commented 7 months ago

However, I am still wondering if Java is affected too

It appears to be the same issue in Java, and here's a PR to deal with it: https://github.com/testcontainers/testcontainers-java/pull/7049

testcontainers / testcontainers-dotnet