Investigate flaky tests

dreis2211 commented 3 years ago

Hi,

both locally and on CI I encounter relatively frequent flaky tests. While I see most of the time that they're flaky and ignore them, it happens often enough that I spent/waste time on identifying if those failures are caused by my changes.

Probably not a complete list of things, but things I noticed lately:

ReactiveElasticsearchRepositoriesAutoConfigurationTests > doesNotTriggerDefaultRepositoryDetectionIfCustomized()
ReactiveElasticsearchRepositoriesAutoConfigurationTests > testDefaultRepositoryConfiguration()
DataCassandraTestIntegrationTests > didNotInjectExampleService()
Jetty10ServletWebServerFactoryTests > whenServerIsShuttingDownGracefullyThenResponseToRequestOnIdleConnectionWillHaveAConnectionCloseHeader()
CouchbaseAutoConfigurationIntegrationTests > defaultConfiguration()

Subjectively, the JDK 15 pipeline is a bit flakier, but that might be a false lead.

Anyhow - I wonder if we can do anything about those. I remember that you did an awesome job of increasing timeouts here and there already and tweaked the testcontainer startup attempts, but I think we're past the testcontainers stage in most of the cases mentioned above.

Cheers, Christoph

dreis2211 commented 3 years ago

Maybe a stupid question, but is there the possibility to "grep" over the failed build scans on ge.spring.io to get a more complete list of flaky tests?

wilkinsona commented 3 years ago

There is indeed and it's really useful. Here are all the test failures over the last 7 days sorted with flaky tests first.

I thought I'd stabilized the Cassandra tests with a timeout increase, but one [failed again today with a 10s timeout](https://ge.spring.io/s/53c6qrchhmhny/tests/:spring-boot-project:spring-boot-test-autoconfigure:test/org.springframework.boot.test.autoconfigure.data.cassandra.DataCassandraTestIntegrationTests/didNotInjectExampleService()#1).

dreis2211 commented 3 years ago

Oh, that is a lovely - I was hoping that Gradle Enterprise had such a feature. Thanks for sharing that view, @wilkinsona.

From a gut feeling - and this might be wrong - most of the failures are related to some sort of timeouts, right? I wonder if the parallelism - as much as it helps - creates some more pressure on the system overall that leads to more timeouts. Given that you did an amazing job of tweaking the task caches, is this maybe something to play around with?

wilkinsona commented 3 years ago

I think another common theme among the flaky tests is that many of them use Docker. Of the five listed above, four of them use Docker and I think parallelism could be part of the cause.

When I was working on the build migration, allowing Gradle to create one worker per core made things really unstable with many Docker-related failures. One worker per two cores seems to work well on our development machines at least. My MacBook Pro has 16 cores so I have the following in ~/.gradle/gradle.properties:

org.gradle.workers.max=8

We configure the max workers to 4 on CI as they have, IIRC, 8 cores. We could try tuning this down, but I'd prefer not to slow everything down to avoid a problem that's at least somewhat Docker specific. I'm tempted to go through another round of timeout increases and see how it goes.

dreis2211 commented 3 years ago

The Docker theme reminded me of something. I wonder if it would help to use the newer versions of the respective container images as well.

I saw that for almost every image there are newer (patch) versions available. (There are also newer major and minor versions available here and there, but that might be too aggressive)

Image	Current	Latest
Cassandra	3.11.2	3.11.10
Mongo	4.0.10	4.0.23
Redis	4.0.6	4.0.14

Neo4J and Couchbase should be already on the latest patch versions.

Let me know if I should give this a test.

wilkinsona commented 3 years ago

This is probably a better test failures link. It adds the CI tag so it filters out failures on our development machines where things may be failing as we're iterating on a new feature.

wilkinsona commented 3 years ago

Yes please, @dreis2211. Upgrading those 3 sounds like a good idea to me.

dreis2211 commented 3 years ago

I also noticed that apparently the libraries in spring-boot-parent didn't get a bomr run lately. There is a testcontainers update to 1.15.2. Let me know if I should create a PR for the update or if you want to run bomr.

wilkinsona commented 3 years ago

I'll run Bomr on all three maintained branches.

wilkinsona commented 3 years ago

I've made a couple of changes today related to flaky tests:

wilkinsona commented 3 years ago

Things seem to have settled down quite a bit recently so I'll close this one now. We can take a look again in the future of we start noticing a rise in flakiness again.

snicoll commented 3 years ago

CouchbaseAutoConfigurationIntegrationTests is flaky again. I've seen it fail several times in the recent past. Reopening to look at it again.

snicoll commented 3 years ago

@daschl suggests that we enable debug logging for com.couchbase. That'll help identify why the bucket isn't ready.

snicoll commented 3 years ago

@daschl also suggested the upgrade to the latest couchbase driver can help. I haven't seen one flaky test since the upgrade so I am gong to close this one again.

spring-projects / spring-boot

Investigate flaky tests #25410