hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.33k stars 1.73k forks source link

VCR test(s) fail during replay: Framework converted resources #14158

Open roaks3 opened 1 year ago

roaks3 commented 1 year ago

Failure rates

Impacted tests

Affected Resource(s)

List of resources migrated to the plugin framework

List of data sources migrated to the plugin framework

Affected PR(s)

https://github.com/GoogleCloudPlatform/magic-modules/pull/7580

Message(s)

2023-03-30T19:27:18.895Z [WARN]  sdk.helper_resource: Error running Terraform CLI command: test_name=TestAccFrameworkProviderBasePath_setBasePath test_step_number=3 test_terraform_path=/bin/terraform test_working_directory=/tmp/plugintest3335870807/work2314434417
  error=
  | exit status 1
  | 
  | Error: Error when reading or editing DNSManagedZone "projects/ci-test-project-188019/managedZones/tf-test-zone-n36lc0kstv": Get "https://www.googleapis.com/dns/v1beta2/projects/ci-test-project-188019/managedZones/tf-test-zone-n36lc0kstv?alt=json": Requested interaction not found
  | 
  |   with google_dns_managed_zone.foo,
  |   on terraform_plugin_test.tf line 7, in resource "google_dns_managed_zone" "foo":
  |    7: resource "google_dns_managed_zone" "foo" {
  | 

2023-03-30T19:27:22.187Z [WARN]  sdk.helper_resource: Error running Terraform CLI command: test_working_directory=/tmp/plugintest3335870807/work2314434417 test_name=TestAccFrameworkProviderBasePath_setBasePath test_step_number=3
  error=
  | exit status 1
  | 
  | Error: Error when reading or editing ManagedZone: Delete "https://www.googleapis.com/dns/v1beta2/projects/ci-test-project-188019/managedZones/tf-test-zone-n36lc0kstv?alt=json": Requested interaction not found
  |
roaks3 commented 1 year ago

It looked to me like none of the requests for these tests were being found during replay, so I wonder if the new framework is doing something different with requests, or possibly recording/replaying them differently.

zachberger commented 11 months ago

I see the same issue: "Requested interaction not found" with google_kms_crypto_key in https://github.com/GoogleCloudPlatform/magic-modules/pull/9409

SarahFrench commented 11 months ago

I'm going to update the issue description with a list of data sources that have been migrated to the plugin framework, and acceptance tests including those data sources are likely to be affected. It is possible for the VCR system to not have recorded interactions for other reasons than issues with the plugin framework migration.

Edit: Done, the failing acceptance tests listed on this issue map pretty well to the migrated data sources.

SarahFrench commented 10 months ago

I'm not free enough to do a deep-dive on this issue, but it occurred to me that a bunch of acceptance tests for the plugin-framework migrated data sources include tests that use past versions of the Google provider as ExternalProviders. This is to compare behaviour of old provider versions vs the newer versions that include migrations. Using ExternalProviders pulls providers from the public Terraform Registry, so I can understand how VCR would be affected.

(GA) Tests that do this are:

For other tests that do this type of comparison of old provider code vs the current code (for example here's a test comparing pre- and post-labels rework) the tests are left ignored in VCR. I think the solution to this GH issue is to leave these tests ignored in VCR and create simple acc tests that don't include the comparison to old provider versions.


Note: The Beta, Firebase-related tests listed in this issue don't do the same thing with ExternalProviders, so there's more investigation needed there.

Also, TestAccFrameworkProviderMeta_setModuleName is performing a niche action with setting the module name, which I think is passed through in HTTP requests? I'll need to investigate that more.

SarahFrench commented 10 months ago

From what I've seen in https://github.com/GoogleCloudPlatform/magic-modules/pull/9742#issuecomment-1876041100 when implementing a new data source using the plugin-framework I'm more convinced that there is a VCR issue that's related to the plugin-framework, and it isn't due to external providers like I said above. 🤦

SarahFrench commented 8 months ago

Investigation

Stuff from running TestAccDataSourceDnsManagedZone_basic in VCR_MODE=RECORDING locally

The test passes, there are files made but they're 'incomplete':

fixtures/TestAccDataSourceDnsManagedZone_basic.seed

2815386111364297813

fixtures/TestAccDataSourceDnsManagedZone_basic.yaml

---
version: 1
interactions:
- request:
    body: ""
    form: {}
    headers:
      Content-Type:
      - application/json
      User-Agent:
      - Terraform/1.8.0-dev (+https://www.terraform.io) Terraform-Plugin-SDK/terraform-plugin-framework
        terraform-provider-google/dev
    url: https://dns.googleapis.com/dns/v1/projects/PROJECT_ID/managedZones/tf-test-zone-0icnli2epb?alt=json
    method: GET
  response:
    body: |
      {
        "error": {
          "code": 404,
          "message": "The 'parameters.managedZone' resource named 'tf-test-zone-0icnli2epb' does not exist.",
          "errors": [
            {
              "message": "The 'parameters.managedZone' resource named 'tf-test-zone-0icnli2epb' does not exist.",
              "domain": "global",
              "reason": "notFound"
            }
          ]
        }
      }
    headers:
      Alt-Svc:
      - h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
      Cache-Control:
      - private
      Content-Type:
      - application/json; charset=UTF-8
      Date:
      - Fri, 16 Feb 2024 17:03:48 GMT
      Server:
      - ESF
      Vary:
      - Origin
      - X-Origin
      - Referer
      X-Content-Type-Options:
      - nosniff
      X-Frame-Options:
      - SAMEORIGIN
      X-Xss-Protection:
      - "0"
    status: 404 Not Found
    code: 404
    duration: ""

There should be more than 1 interaction listed for the test that provisions and deletes resources.

The single interaction recorded there is from the plugin framework provider, because "terraform-plugin-framework" is in the user agent. Where are the other API interactions for the PF provider?

Stuff from running the above with a debugger in the VCR-related code

After investigating this a bit I've found the problem shows up in the acctest package's closeRecorder function. Hopefully that's the only place that will need a fix.

That function has a block that's repeated once per SDK/PF provider (SDK provider block is first, PF provider block is second). Those blocks roughly perform these actions:

When you run a test in VCR mode there will be an SDK and PF client present. When the first block (SDK) runs it has access to the vcr source and as a result is able to save VCR stuff to file. That block ends by deleting the vcr source from the global map variable. The PF block runs immediately after and cannot find the vcr source, as it was just deleted, and doesn't write anything to file. Edit: this only impacts saving the seed, not the YAML file.

This suggests that only interactions that the SDK client has are ever written to file, and the PF client stuff is never recorded. However in the example I show above there is 1 interaction recorded from the PF provider as "terraform-plugin-framework" is in the user agent. I'll need to look at this more.

Thinking forward, this suggests there are problems with tests that will use the 2 separate clients of the SDK and PF provider configs. Especially if you look at the comment above getCachedConfig:

// Returns a cached config if VCR testing is enabled. This enables us to use a single HTTP transport // for a given test, allowing for recording of HTTP interactions. // Why this exists: schema.Provider.ConfigureFunc is called multiple times for a given test // ConfigureFunc on our provider creates a new HTTP client and sets base paths (config.go LoadAndValidate) // VCR requires a single HTTP client to handle all interactions so it can record and replay responses so // this caches HTTP clients per test by replacing ConfigureFunc

By having different clients for the SDK and PF code we break the assumption of "a single HTTP transport for a given test, allowing for recording of HTTP interactions"

SarahFrench commented 8 months ago

More notes:

SarahFrench commented 8 months ago

In addition to what I said before, there is another incompatibility between VCR and the PF-implemented data source:

When performing Read actions (no C/U/D on data source) a new client is used which is separate to the client within the provider's config. When you enable TF_LOG = DEBUG or higher you don't see the request/response from when the data source is being read into state because this new client is separate to that logging logic somehow. The 'central' client in the PF provider is referenced here via option.WithHTTPClient when setting up the new client though.

However when the 'check destroy' function for the acc test does an HTTP request using the plugin-framework provider's client, instead of a new client made in the Read function, that request is reflected in the debug logs and is the single API interaction shown in the VCR recordings from my investigation above.


Summary

Solutions?

Looks like making the muxed provider use a single central client would be a good idea to ensure VCR compatibility during migrations in future. And to avoid using client libraries within migrated datasources. For now we should avoid any new PF-migrated data sources or resources until these problems are addressed.

I'm going to be working on planning proposals for re-starting the muxing/migration project this quarter so I won't be acting on any of the ideas above any time soon. But doing this investigation was very useful for figuring out previously unidentified risks with the migration project.

SarahFrench commented 6 months ago

I'm currently writing up an RFC that includes discussion about this issue and possible solutions.

While I was taking a look in the code I noticed that the SDK provider has some caching, due to how the provider will be re-configured multiple times during a test:

https://github.com/hashicorp/terraform-provider-google/blob/f35814917edb95f75f893dcc74275c4fd5ef84ca/google/acctest/vcr_utils.go#L452-L458

It doesn't look like there's similar caching for the PF provider. Depending on what solution we go with we may need to look into that.

SarahFrench commented 2 months ago

Update: I've been working on fixing issues with the muxing in the Google provider(s) and those changes fix the issue with VCR tests. I won't close this issue until those changes are merged, but this issue is 'solved' in that I know why it happens and I have an early version of a fix implemented