Closed rnorth closed 2 years ago
Example build failures (not sure if these will be overwritten when we re-run tests): https://circleci.com/gh/testcontainers/testcontainers-java/12930 https://circleci.com/gh/testcontainers/testcontainers-java/12796 https://circleci.com/gh/testcontainers/testcontainers-java/12791
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe this is a mistake, please reply to this comment to keep it open. If there isn't one already, a PR to fix or at least reproduce the problem in a test case will always help us get back on track to tackle this.
This continues to be a problem; our tests fail even if retried. This gives me some doubt that the module is actually stable enough for real use.
I think we're going to have to have to reach out to Couchbase/Couchbase community for help. In the worst case scenario, if we can't fix it, I'm inclined to remove this module.
@rnorth : Just for the record, we've observed problems with couchbase testcontainers as well; https://github.com/jhipster/generator-jhipster/issues/10594
@SudharakaP thanks for confirming, and I’m sorry to hear that.
It sounds like a couple of Couchbase experts may be willing to lend a hand - hopefully there’s a straightforward solution so that we can make this module more reliable. Fingers crossed!
Sent with GitHawk
Couchbase employee here, we will investigate this issue and come back to you as soon as possible.
@rnorth : You are welcome. :smile:
@deniswsrosa : Just for the record, I was wondering if this issue is similar to https://github.com/testcontainers/testcontainers-java/issues/715 where the startup timeout in Couchbase is 120 being inadequate. I was trying to create a custom wait strategy to test it out, but couldn't get it working yet (due to my lack of knowledge about Couchbase probably) :smile:
@rnorth : You are welcome. 😄
@deniswsrosa : Just for the record, I was wondering if this issue is similar to #715 where the startup timeout in Couchbase is 120 being inadequate. I was trying to create a custom wait strategy to test it out, but couldn't get it working yet (due to my lack of knowledge about Couchbase probably) 😄
thanks for the heads up, I will ping this back to the team.
Some additional information, if it can help. In JHipster project, we encounter the similar issue as mentionned by @SudharakaP.
In our CI, it works perfectly well with OpenJDK11 :
echo '*** Using OpenJDK 11'
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install -y openjdk-11-jdk
sudo update-java-alternatives -s java-1.11.0-openjdk-amd64
java -version
But, it failed randomly when using OpenJDK8:
echo '*** Using OpenJDK 8'
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk
sudo update-java-alternatives -s java-1.8.0-openjdk-amd64
java -version
Are there any alternatives while we wait for a solution?
Trying to triage this a bit. Am I correct that the initial exception reported in this ticket is different from the failing runs linked? In the linked run I see:
Caused by:
org.testcontainers.containers.ContainerLaunchException: Timed out waiting for URL to be accessible (http://localhost:32934/ui/index.html should return HTTP 200)
while this ticket reports
shouldInsertDocument - org.testcontainers.couchbase.Couchbase4_6Test
com.couchbase.client.core.RequestCancelledException: Request cancelled in-flight.
at com.couchbase.client.core.endpoint.AbstractGenericHandler.handleOutstandingOperations(AbstractGenericHandler.java:686)
at com.couchbase.client.core.endpoint.AbstractGenericHandler.handlerRemoved(AbstractGenericHandler.java:667)
at
The first one looks like the server is not coming up in time until it stops waiting, the second one is completely different: the SDK sent a request into the cluster but while the request was in-flight on the socket the socket got closed- and as a result the client has no other option as to cancel the request.
We get a mix of failures, but com.couchbase.client.core.RequestCancelledException: Request cancelled in-flight
is by far the most prevalent. If we could only fix one issue, it would be this one.
It occurs randomly for some % of CI builds and local builds, but unlike other random failures it is not helped at all by retries within a build (we have a JUnit rule to retry specific failed tests that we think are 'flaky').
This makes me think that there's some statefulness at play that's causing one failure to cascade and pollute subsequent tests. I was unable to work out why when I investigated this back in May.
@rnorth one thing I was thinking, since it happens some % - maybe it would work if we increase the poll wait time up from one minute? It might at least be worth a try.
The request cancellations are very likely a side effect of a problem, if the network is healthy then I suspect we are connecting to that server port and it's not ready yet (kinda indicating polling longer waiting until it comes online too).
Are you able to reproduce this locally too or do you only see it in CI?
@daschl : In our case we only see this on CI; but I am starting to think the error in our end might not be exactly same as in this issue. We see;
{"app_name":"travisCouchbaseNoCache","app_port":"-1","level":"WARN","logger_name":"o.t.u.RegistryAuthLocator","message":"Failure when attempting to lookup auth config (dockerImageName: couchbase:6.0.0, configFile: /home/runner/.docker/config.json. Falling back to docker-java default behaviour. Exception message: /home/runner/.docker/config.json (No such file or directory)","thread_name":"main","timestamp":"2019-10-09T18:12:52.726Z"} {"app_name":"travisCouchbaseNoCache","app_port":"-1","level":"INFO","logger_name":"o.t.c.w.s.HttpWaitStrategy","message":"/lucid_mcnulty: Waiting for 60 seconds for URL: http://localhost:32782/ui/index.html","thread_name":"main","timestamp":"2019-10-09T18:12:53.812Z"} {"app_name":"travisCouchbaseNoCache","app_port":"-1","level":"INFO","logger_name":"o.t.c.w.s.HttpWaitStrategy","message":"/lucid_mcnulty: Waiting for 60 seconds for URL: http://localhost:32782/pools/default/","thread_name":"main","timestamp":"2019-10-09T18:13:17.845Z"} {"app_name":"travisCouchbaseNoCache","app_port":"-1","level":"WARN","logger_name":"c.c.c.c.e.Endpoint","message":"[localhost:32775][KeyValueEndpoint]: Authentication Failure: Bucket not found on Select Bucket command","thread_name":"cb-io-1-2","timestamp":"2019-10-09T18:13:21.245Z"} {"app_name":"travisCouchbaseNoCache","app_port":"-1","level":"INFO","logger_name":"o.t.c.CouchbaseQueryServiceWaitStrategy","message":"Waiting for 120 seconds for QUERY service","thread_name":"main","timestamp":"2019-10-09T18:13:21.387Z"}
before the timeout; there's no other exceptions. Here's a typical run; https://github.com/jhipster/generator-jhipster/pull/10498/checks?check_run_id=254018117
I was also wondering whether our issue is similar to https://github.com/testcontainers/testcontainers-java/issues/715 where the startup timeout in Couchbase is 120 being inadequate.
Let me know if you need me to create a new issue for this one since I don't want to pollute this thread. 😄
~I might have a fix for this in #2076. This involved an incredibly painful amount of trial and error 😭.~
Sorry, I was wrong - #2076 might improve some aspects of resilience, but there are still numerous failures when running on slower machines.
With the couchbase-resilience
branch on an old machine, I'm getting the following logs (long snippet): https://gist.github.com/rnorth/f61637b4be934d3745ecb35d28961f5b
It's interesting that these failures occur waiting for a (supposedly created) bucket to appear. This failure looks different to other failures we're seeing on CI.
One of the important aspects is that the Couchbase Server HTTP API is inherently asynchronous, so even if you get a response doesn't mean it is immediately available. I wonder if it makes sense to define the explicit endpoints / "stages" the container code triggers and then I can go hunting for the right polling logic to see if it is actually in a stage to move on and declare it "done".
FYI I submitted #2081, it seems that there was a faulty execStart
that did not wait for the completion.
While working on it, I also discovered that if we remove execStart
(which means that the ports will not be proxied) then the client fails with "request cancelled inflight".
@daschl I have a feeling that the error message is wrong, because if the destination port cannot be opened, it should clearly state that since there wasn't any request to send
@bsideup the way the SDK logic is implemented is that if you get a "request cancelled inflight" it actually has been written into netty already, so the socket must've been opened at some point. Otherwise the op would time out because we couldn't find a socket to write it to.
We are either cancelling if the handler is removed or we cought an exception on the channel (https://github.com/couchbase/couchbase-jvm-core/blob/master/src/main/java/com/couchbase/client/core/endpoint/AbstractGenericHandler.java#L675) in both cases, the netty channel must've been open at some point. That said, there should be netty warn logs around this timeframe in the logs which should tell us more about the explicit reason.
😬 reopening - we still don't have a fix yet.
I've been running a fairly stupid series of scripted tests to try and at least narrow down where the race is occurring, injecting sleep
into various locations to see if it reduces the failure rate. So far (400+ runs exercising 14 different sleep
locations plus @bsideup's change), none has been flawless. The most promising, a 10s sleep just before bucket creation (containerIsStarted
), reduced the failure rate to a mere 4%, but it's obviously not a fix either. We seem to have a concurrency bug that we're not seeing, or more than one bug conspiring together.
I'd be happy to share the 100s of MBs of logs I've gathered during this exercise 😂. One of the more pertinent sets of logs, from the most promising test case, are below:
At this point I'm afraid I'm at a bit of a loss - we're obviously missing something vital, and we're using up a lot of time and energy to make little progress!
I'd be grateful for any and all suggestions, no matter how radical!
@rnorth I've been experimenting with a container that does not use socat and I think I got somewhere. I'm not fully done with it yet but everything so far seem to work fine. I've got it working in a standalone lib right now but as soon as I got my POC completely working I'll do a PR to this repository - if you are curious: https://gist.github.com/daschl/e32e05e6abc31e450f67c23fe30c3826#file-couchbasecontainer-java
Comparing to the current version:
I hope I can make more progress in december and will report back.
@daschl thanks so much! I've been having another unsuccessful attempt to fix things this afternoon, and was just about ready to throw in the towel. Your message is extremely good timing!
For what it's worth, the current implementation has numerous places that deserve improvement - polling for creation of buckets, proper handling of non-200 HTTP response codes, longer connect timeouts, etc. The list is long. However many patches I've made, though, there are still an extraordinary number of different kinds of failures.
This is what makes me very worried about our chances of incrementally improving the reliability of this module, so a more drastic rework seems sensible.
I'll really look forward to your next updates - thanks for your efforts.
Update: it seems that https://github.com/testcontainers/testcontainers-java/pull/2106 stabilized our tests and now @Flaky
works as intended.
Just to note, although our Couchbase daily builds seem fairly stable, the intermittent Couchbase failures seems to be still there after upgrading to version 1.14.0.
https://github.com/hipster-labs/jhipster-daily-builds/runs/597246279?check_suite_focus=true
@SudharakaP thanks for updating! Do you have more logs of such failures? /cc @daschl
@bsideup : No we run that build once a day and that's the only such time it failed (out of the 6 times it ran after upgrading to 1.14.0). But if and when there's more I'll post them here.
@pascalgrimaud : Could we schedule this build to run several times a day to speed up the process of finding more logs? :smile:
@SudharakaP : sure, feel free to change the cron and merge it :)
Done; I'll post back tomorrow with hopefully more logs. :smile:
So, here are five failures; two of which involve Maven and the other three involving Gradle;
Gradle
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/81289930
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82411492
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82524483
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82802953
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82935879
Maven
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82443392
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82567070
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82761556
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/82866603
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/83078352
https://github.com/hipster-labs/jhipster-daily-builds/actions/runs/83033289
@bsideup : Do you think this is enough samples or do you guys want more? :smile:
EDIT: I've added some more samples above. :smile:
@SudharakaP looking at the logs, I do not see any actual errors (i.e. https://github.com/hipster-labs/jhipster-daily-builds/runs/600774837?check_suite_focus=true) .. I only see
2020-04-20 05:38:23.395 INFO 7589 --- [ main] o.t.c.wait.strategy.HttpWaitStrategy : /reverent_keldysh: Waiting for 60 seconds for URL: http://localhost:32783/pools
Is that suggesting it times out on the wait?
edit: oh, one other had an actual error in it:
Caused by: java.lang.IllegalStateException: Could not create primary index for bucket jhipster: Response{protocol=http/1.1, code=500, message=Internal Server Error, url=http://localhost:32781/query/service}
@daschl : Thanks for looking into it. Feel free to let us know if you need any further information or logs. :smile:
@SudharakaP btw the new release that got out today has a change in it which will make it more intelligent to wait until the query engine is up - that maybe fixes the second error from above.. can you try upgrading to the latest bugfix version and report if it still happens?
@SudharakaP also I saw in one of the gradle failures it would just print the exception type, but not the associated message. is it possible in gradle to enable that so that we also see the exception message and cause, potentially?
@SudharakaP btw the new release that got out today has a change in it which will make it more intelligent to wait until the query engine is up - that maybe fixes the second error from above.. can you try upgrading to the latest bugfix version and report if it still happens?
@daschl : Thanks for the input. We have upgraded to 1.14.3 the latest version. There wasn't any Couchbase failures since last month so I am wondering if this problem is gone now. :thinking: Let us give some more time.
@SudharakaP also I saw in one of the gradle failures it would just print the exception type, but not the associated message. is it possible in gradle to enable that so that we also see the exception message and cause, potentially?
I'll look into enabling the Gradle logs to get the detailed error messages.
Just to let you know for the past couple of months there was only one Couchbase failure; https://github.com/hipster-labs/jhipster-daily-builds/runs/764200230?check_suite_focus=true. So I think the latest version has improved things a lot. Thanks a lot for your work. :smile:
@SudharakaP I've been trying to spot the error in the log you pointed out but I could only see build warnings - can you point me to the test failure with stack trace/logs? then I can take a look :) thanks!
@daschl : I thought this one is a timeout. So the line is,
2020-06-12 04:15:57.190 INFO 6779 --- [ main] o.t.c.wait.strategy.HttpWaitStrategy : /optimistic_lehmann: Waiting for 60 seconds for URL: http://localhost:32783/pools/default/buckets/jhipster
Notice that before two months ago (before we upgraded to the latest version) we had this kinda logs; they had minor differences; such as this one;
https://github.com/hipster-labs/jhipster-daily-builds/runs/610843718?check_suite_focus=true
which has the line,
2020-04-23 04:11:17.675 INFO 7633 --- [ main] o.t.c.wait.strategy.HttpWaitStrategy : /amazing_kirch: Waiting for 60 seconds for URL: http://localhost:32783/pools
before the timeout. Although I think this has greatly improved and failures are rare; both issues seems related; so just thought of letting you know. I'll report more if I see this in the future. :thinking:
edit: oh, one other had an actual error in it:
Caused by: java.lang.IllegalStateException: Could not create primary index for bucket jhipster: Response{protocol=http/1.1, code=500, message=Internal Server Error, url=http://localhost:32781/query/service}
Hi I intermittently get a similar error:
This error occurs when try to create primary index
@jorgerod thanks for sharing - that's a transient error that we should transparently retry. I'll work on it.
@daschl Is this about the issue in the initial post, or something else?
@kiview I think the initial post has been solved since than (that was before the rewrite). I was referring to @jorgerod last comment which seems to be that since we create a bucket and then immediately the index, the index might not have gotten the info about the bucket yet. The error says that it will retry the index creation in the background automatically, so we can safely ignore that and move on - it will be there eventually. I'm working on a PR for it right now.
@daschl Ok cool. I also feel that the original issue is solved. I'd propose we close this issue and since you are aware of the other issue, we just keep track of it once you submit the PR, ok?
agreed. PR filed at https://github.com/testcontainers/testcontainers-java/pull/4681
For a long time we've been getting intermittent failures in CI, such as this:
Unfortunately this is a nuisance for us and for contributors. I've tried suggestions such as this but it hasn't improved reliability.
Can anyone with more Couchbase expertise help investigate?
To reproduce, run the test suites under
modules/couchbase
repeatedly.