vmware-tanzu / vm-operator

Self-service manage your virtual infrastructure...
Other
101 stars 47 forks source link

Enable test parallelism to identify and correct flaky tests #27

Open akutz opened 1 year ago

akutz commented 1 year ago

I've noticed a bunch of new, flaky tests related to the content source and vmpubreq controllers integration tests and seem related to poorly implemented Ginkgo. For example, from the first run of the integration test job for PR #26:

------------------------------
• Failure [10.293 seconds]
Integration tests
/home/runner/work/vm-operator/vm-operator/test/builder/test_suite.go:251
  Reconcile ContentSource
  /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:165
    when ContentSource and ContentLibraryProvider exists
    /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:166
      when a new ContentSource with duplicate vm images is created
      /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:279
        should reconcile and generate a new VirtualMachineImage object [It]
        /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:319

        Timed out after 10.001s.
        Expected
            <int>: 3
        to equal
            <int>: 2

        /home/runner/work/vm-operator/vm-operator/controllers/contentlibrary/contentsource/contentsource_controller_intg_test.go:326

Re-running just the IT job usually clears up flakes like above. I believe these are occurring because our tests are race-y, and once people start creating PRs in this project, we will see these errors more frequently. When that happens:

  1. Click on the job that failed, ex. integration-test: image
  2. Click the icon to re-run just the failed job: image
  3. Click on the Re-run jobs button to re-run the failed job and its dependents: image

This is usually enough to fix things. However, I want to set a goal that we run Ginkgo with the -p flag, which enables suite parallelism. This would very quickly identify all of the issues we have related to the way we've constructed our tests.

This issue tracks the need to enable parallism for our tests suites.

akutz commented 1 year ago

Hi @yi0909 and @dilyar85,

Maybe we should file one or two more issues to at least try addressing the two flakes about which we are readily aware? It's been four runs, and this job keeps hitting these flakes:

I am currently on the fourth attempt; fingers crossed!