Closed mgroves closed 7 months ago
The Couchbase module has not been tested with the enterprise version or the minor version 7.2. I do not know if any APIs have changed. IIRC, Java's implementation distinguish between the community and enterprise edition. This is not the case for .NET (the enterprise version is not specifically supported). There might be even more configurations necessary to support the enterprise version.
Any thoughts on this, or should I just submit a PR similar to this and go on with my life?
As you may notice, the implementation of the Couchbase builder is one of the most complex ones we have in Testcontainers. Setting up Couchbase (getting it running) requires quite a few steps.
I do not believe this is specifically a Testcontainers issue. Have you seen this comment? You might be more familiar with Couchbase, its REST API, and various configuration methods, but this does not seem like a proper fix. I favor getting Couchbase into a proper state and not calling the REST APIs over and over again.
@HofmeisterAn
I generally agree adding retries is non-ideal. However, in my tangentially related project https://github.com/brantburnett/couchbasefakeit I found it was basically required. Couchbase tends to startup in pieces as various components come online and, unfortunately, there is no good way to know when some of these components are actually online. There is even an issue to track the limitation in general: https://issues.couchbase.com/browse/MB-11484
Thank you @HofmeisterAn and @brantburnett for taking a look!
The Couchbase module has not been tested with the enterprise version or the minor version 7.2. I do not know if any APIs have changed.
Yes, you're right, sorry, I should have mentioned. I think this a red herring; I'm getting the same problem with 7.0.2. I don't think any of the relevant APIs have changed between 7.0.2 and 7.2.
Have you seen this comment?
For whatever reason, I didn't notice that, even though I was browsing around in that code area. I have a blind spot for comments sometimes 😓
I do not believe this is specifically a Testcontainers issue.
I didn't expect it to be a testcontainers issue, but I concur with @brantburnett in that the various parts of Couchbase come online in no particular order. I don't think there's any proper ordering of all the HTTP requests to the API that won't eventually lead to a similar condition.
Also, I should mention that I've tried all kinds of wait strategies before I even considered looking at the testcontainers code, to no avail. I believe those wait strategies get applied after the HTTP requests in question.
I'm open to other approaches, but after running integration tests all day today with my hack in place, I haven't had a single testcontainers exception. Certainly my hacky try/catch/loop can be improved upon with something like a Polly policy that can be reused for each HTTP request, or, based on my half-day of empirical data, just the first HTTP request.
Have a great weekend!
Yes, you're right, sorry, I should have mentioned. I think this a red herring; I'm getting the same problem with 7.0.2. I don't think any of the relevant APIs have changed between 7.0.2 and 7.2.
No worries. It was just my first assumption that the version/edition is simply not compatible. I did some tests with both versions and editions and cannot reproduce it. But since I initially ran into this problem too, it looks like the comment I mentioned is likely a race condition (and it just makes it less likely to run into it).
I still try to understand and fix the root cause, especially because Java does not run into this issue (I guess it does not).
I believe those wait strategies get applied after the HTTP requests in question.
Yes, the wait strategies are executed after the start-up callback. The start-up callback initially configures the Couchbase node.
Could you check the actual HTTP response code (EnsureSuccessStatusCode)? Maybe that will help us, together with the Couchbase API docs, to understand what goes wrong. Perhaps it requires the retry logic only for certain HTTP status codes.
Of course, we can add the retry logic if necessary, but as mentioned, I prefer to fix the root cause and not fight symptoms.
Seems like turning Fiddler on greatly reduces the amount of times that this bug happens! I can't really use .NET/debugging as far as I know, since I believe the request fails and throws an exception before even reaching EnsureSuccessStatusCodeAsync
.
I tried about a dozen times, and wasn't able to catch the RenameNodeRequest
(Naming a node) portion failing, but I did catch the next one in the sequence, SetupNodeServicesRequest
(Assigning Services endpoint).
POST http://127.0.0.1:56042/node/controller/setupServices HTTP/1.1 Host: 127.0.0.1:56042 Content-Type: application/x-www-form-urlencoded Content-Length: 32
services=kv%2Cindex%2Cn1ql%2Cfts HTTP/1.1 504 Fiddler - Receive Failure Date: Mon, 04 Dec 2023 19:21:21 GMT Content-Type: text/html; charset=UTF-8 Connection: close Cache-Control: no-cache, must-revalidate Timestamp: 14:21:21.462
[Fiddler] ReadResponse() failed: The server did not return a complete response for this request. Server returned 0 bytes.
It seems to be a "refused connection" situation. I'm not sure if there's any other way to handle this than a retry.
The REST API docs only show responses for 200, 401, 404, and 400.
Also, it doesn't seem like it's a timeout issue, since default is 100 seconds, I don't see that value being set anywhere in testcontainers code, and this 504 is failing well before 100 seconds.
It seems to be a "refused connection" situation. I'm not sure if there's any other way to handle this than a retry.
I did not find any further information that helps with that issue either. I was thinking about the retry logic. What do you think about the following implementation below? Perhaps we can create an upstream issue too. I imagine we are not the only ones who run into this issue. However, I am still wondering if Java is affected too (but of course, there are a lot of timings involved).
private sealed class RetryHandler : DelegatingHandler
{
private const int MaxRetries = 5;
public RetryHandler()
: base(new HttpClientHandler())
{
}
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
for (var _ = 0; _ < MaxRetries; _++)
{
try
{
return await base.SendAsync(request, cancellationToken)
.ConfigureAwait(false);
}
catch (HttpRequestException)
{
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken)
.ConfigureAwait(false);
}
}
throw new Exception($"Unable to configure Couchbase. The HTTP request '{request.RequestUri}' did not complete successfully.");
}
}
This allows us to pass the RetryHandler
to the HttpClient instance.
Edit: I think I forgot to mention that I would like to avoid adding additional dependencies such as Polly. I prefer to include as few dependencies as possible to minimize dependency clashes (version conflicts).
That seems good to me, especially since it would just be one simple change to the code in ConfigureCouchbaseAsync
. Would you be okay with putting Polly in there, and perhaps some kind of backoff, exponential or otherwise? (I understand if you don't want to add another dependency; I'm happy to create the PR either way). Just saw the edit about Polly :)
Would you be okay with putting Polly in there
I have just updated my comment after you replied 😃. See my edit. If there is not a real need, I would prefer to avoid adding Polly.
However, I am still wondering if Java is affected too
It appears to be the same issue in Java, and here's a PR to deal with it: https://github.com/testcontainers/testcontainers-java/pull/7049
Testcontainers version
3.6.0
Using the latest Testcontainers version?
Yes
Host OS
Windows
Host arch
x64
.NET version
8.0
Docker version
Docker info
What happened?
Trying to use Couchbase testcontainers in an NUnit test. Here's the intialization:
This works fine around 50% of the time. The other 50% results in:
I've traced this to CouchbaseBuilder.cs ->
ConfigureCouchbaseAsync
method. The firsthttpClient.SendAsync
is where the exception is happening. Maybe the earlierWaitUntilNodeIsReady
in the method isn't enough, and the rename node endpoint isn't quite ready?So, I wrapped it in a crude retry:
And that seems to have stopped the exceptions. However, not being an expert in testcontainers, there might be something else I can do instead? Or maybe there's a better retry mechanism already in testcontainers that I can use instead of this? There are a bunch more HTTP requests in this method, maybe they could all benefit from a retry?
Any thoughts on this, or should I just submit a PR similar to this and go on with my life?
Relevant log output
Additional information
No response