Show new VCR failures separately from preexisting ones in VCR status reports

rileykarson commented 2 years ago

After a run of our presubmit "VCR" tests, we display a summary of the tests that failed after punching through to the live APIs. We'd expected the number of failures to be near-zero, but in practice, we tend to accumulate a couple failures a week that we burn through intermittently. We haven't built a process to systematically address these failures, and struggle to find the cycles to do so. These failures generally start because:

Tests that don't work with VCR due to parallel execution of identical resources are not tagged as VCR-unfriendly
A quota problem will get introduced by our nightlies, and then a client library update will force a re-recording of many tests when the API message we send changes.
An API changes its behaviour, and then a client library update will force a re-recording of many tests wen the API message we send changes.

As a result, when new contributors interact with the repo, the Magician (un)helpfully reports a large number of test failures to them. In cases like https://github.com/GoogleCloudPlatform/magic-modules/pull/6412#issuecomment-1220718384, most of the failures are completely unrelated to the user's change.

We should display more pertinent information to users in the VCR status report, highlighting tests that are newly failing as of their change. For example:

Tests passed during RECORDING mode:
TestAccCloudfunctions2function_cloudfunctions2BasicAuditlogsExample
TestAccFirebaserulesRelease_BasicRelease

Tests newly failing during RECORDING mode:
TestAccContainerCluster_withNodeConfigReservationAffinitySpecific
Please fix these to complete your PR

<status> Already-failing tests </status>
<details>

Tests already failing:
TestAccComputeInstance_networkPerformanceConfig
TestAccComputeInstance_soleTenantNodeAffinities
TestAccComputeGlobalForwardingRule_internalLoadBalancing
TestAccCloudRunService_cloudRunServiceStaticOutboundExample
TestAccPrivatecaCertificateAuthority_privatecaCertificateAuthoritySubordinateExample
TestAccSqlDatabaseInstance_withPrivateNetwork_withAllocatedIpRange
</details>

View the [build log](https://storage.cloud.google.com/ci-vcr-logs/beta/refs/heads/auto-pr-6412/artifacts/6bfc7900-b1fa-417b-b6cb-9838383c26e1/build-log/recording_test.log) or the [debug log](https://console.cloud.google.com/storage/browser/ci-vcr-logs/beta/refs/heads/auto-pr-6412/artifacts/6bfc7900-b1fa-417b-b6cb-9838383c26e1/recording) for each test

We can likely use a pretty simple heuristic here by comparing against main.

We'd add a REPLAY run step, run on commit merge, costing around half an hour of machine time per commit, and store the list of failing tests in a GCS bucket. Each PR submitted against the repo will have a branch point from main- its merge base- and we can look up the results in the GCS bucket to determine what tests were failing.

There is a slight timing issue as folks could open a PR before the post-submit replay finished. That's generally unlikely, as a half hour is pretty short, but we could choose a strategy to handle it- return a warning to the user, and avoid filtering on that PR, wait for the commit (w/ some timeout, say an hour), or step back in commits until we find a match.

slevenick commented 2 years ago

This could likely be done by calculating the current failing tests for each PR during the VCR run step, then copying that file to the storage bucket that holds the main cassettes during the merge step. This may have confusing merge problems though

rileykarson commented 2 years ago

Yeah- an approach with commits and merge bases gets around the merge problems, at the cost of needing to run on merges.

hashicorp / terraform-provider-google

Show new VCR failures separately from preexisting ones in VCR status reports #12365