segmentio / terraform-provider-segment

Terraform provider for Segment, using the Public API
https://registry.terraform.io/providers/segmentio/segment/latest
MIT License
25 stars 3 forks source link

Segment object created on Tracking Plan provider failure #94

Closed Oracen closed 4 months ago

Oracen commented 4 months ago

Description

I've found an issue where Segment resources are created even when terraform apply fails and the state is not updated. I noticed this in the context of tracking plans, but it's likely duplicated in any "compound" resource that invokes more than one Segment endpoint.

To Reproduce

Example Terraform code that triggers the issue:

A minimal example that will fail but create a resource:

provider "segment" {
    token = var.segment_public_api_token
}

resource "segment_tracking_plan" "api" {
  name        = "Plan"
  type        = "LIVE"
  description = "desc"
  rules = [
    {
      key     = "Add Rule"
      type    = "TRACK"
      version = 1
      json_schema = jsonencode({
        "$schema": "http://json-schema.org/draft-07/schema#",
        "properties" : {
          "context" : {"type": ["object", "any"]}, // Triggers error
          "traits" : {},
          "properties" : {}
        }
      })
    }
  ]
}

This code will result in a 400 response (screenshot below), which I assume is indicating that the tracking plan rule is invalid. However the tracking plan itself will still be created in the console, without any rules attached. Screenshot from 2024-03-05 10-18-14

My guess is this is because the tracking plan provider is hitting the Create Tracking Plan endpoint followed by Update Tracking Plan Rules, without any preflight checks. This operation is non-atomic, thus allowing the remote system to fall into a "halfway" state. The 400 response marks the tf apply operation as invalid, and so no resource is added to the state. When TF apply is run again, a new resource is created and the process repeats.

A recommended fix would be to align the TF resources with atomic operations in the way the AWS provider does (yes the code will get a little ugly), or add some saga-like functionality to allow a rollback of a partially successful operation. The latter is probably much more complicated.

As a follow-up, it may also be helpful to pass back better debugging information in the error message. While there are security concerns about exposing too much information from the API, it would improve the DX to know the source of validation errors. Particularly where there are duplicate resources being created, guess-and-check is no longer a valid debugging strategy! :laughing:

cv commented 4 months ago

Hi @Oracen!

Thank you for the thorough issue report and steps to reproduce. @deanhuynh has just shipped a new version that hopefully addresses this. Can you try it out and let us know if it solves your problem?

deanhuynh commented 4 months ago

To be clear, the change does not rollback the creation upon failure. Instead, it ensures that the resource is tainted so upon the next apply, it is deleted then recreated and the resulting state is as expected. This follows the same convention as in the AWS provider.

deanhuynh commented 4 months ago

We will also look into improving the error messages separately 🙂 thanks for raising!

Oracen commented 4 months ago

Just upgraded, seems to be working as intended! Solid LGTM.

Instead, it ensures that the resource is tainted so upon the next apply, it is deleted then recreated and the resulting state is as expected.

Thanks for the insight... I haven't touched the depths of Terraform since ~v0.11 so I appreciate the clarification on the mechanics.

Love the work team