hashicorp / consul-terraform-sync

Consul Terraform Sync is a service-oriented tool for managing network infrastructure near real-time.
Mozilla Public License 2.0
120 stars 27 forks source link

Allow CTS with more granularity on services activity on Consul #1056

Open FranksRobins opened 1 year ago

FranksRobins commented 1 year ago

Description

Allow CTS with more granularity on services activity on Consul so that one can configure a specific task upon registration and another one on de-registration if needed. Something like:

# This task is triggered by the flapping in and out of services, as usual
task {
  name           = "taskA"
  enabled        = true,
  providers      = ["my-provider"]
  module         = "path/to/my/module"
  condition "services" {
    names = ["web", "api"]
   }
}

# This task is only triggered by the flapping in of services
task {
  name           = "taskB"
  enabled        = true,
  providers      = ["my-provider"]
  module         = "path/to/my/module"
  condition "services" {
    names = ["web", "api"]
    event = "register"
   }
}

# This task is only triggered by the flapping out of services
task {
  name           = "taskC"
  enabled        = true,
  providers      = ["my-provider"]
  module         = "path/to/my/module"
  condition "services" {
    names = ["web", "api"]
    event = "deregister"
   }
}

Use Cases

This would allow for asymmetrical deployments in more complex infrastructures that require that granularity. Sometimes the method to remove a resource is not the same as integrating it. Sometimes a resource can become a dependency for other services using it and so those services would require a graceful shutdown or reconfiguration/restart before removing the terraform managed resource which could otherwise break the system. Some infrastructure components can also need a graceful shutdown. In this regard, this feature would add more fault tolerance in general.

Some operations like sharding or manipulations on a blockchain system are hard to be reversible or simply not meant to be as they are not elastic but plastic. Those systems need to manage change, not necessarily just grow and shrink.

I rely a lot on the self-service networks paradigm when developing infrastructures. For example, one Consul cluster is used for service networking in the conventional way (with services registered by Nomad) while another smaller cluster is used only for service requests and subscriptions. In that way requests and use for calling operations while subscriptions are used for long running processes. In this architecture CTS is not just used for updating the infra with network rules upon a newly deployed app or service. Rather is it used as a "central manager" that administers the infrastructure. Administration tasks are not always reversible, sometimes they are one-off or periodic batch operations.

As another example, I use Gitlab for ephemeral pipelines that process a manifest file. The creation of the repo, the registration of the runner and adding the files is all managed by CTS. Once the pipeline is done with its work, everything is destroyed but a copy of that manifest is also added to Github which serves as the "main" or "reference" repo. One is ephemeral (elastic) while the other is persistent (plastic), both could be managed by CTS but since it is triggered by the same "service request", CTS would remove the file resource on both Gitlab and Github. Of course I could have 2x tasks configured in CTS with one lets say based on a key, I could upload that file to it but then if any of those keys is deleted it would either delete the file on the repo or I would need to use prevent_destroy = true which would cause CTS to crash and stall. I would then need to manually run terraform state rm which is not the best practice. Even if it does not happen, I would end up having as many keys as manifest on Consul, each of them with the content of the file. Then it defeats the whole purpose of having a repo. This management would be much easier if I could configure CTS with a taskA on both registration and de-registration as usual but a 2nd taskB on registration only so CTS could manage the infrastructure asymmetrically.

Alternative Solutions

An alternative would be to use prevent_destroy = true in the Terraform file but this will cause an error and have CTS crash and not be able to perform any other operation which breaks the infrastructure management. Some folks from the Terraform team have proposed to use terraform state rm but this is simply not scalable and will most likely add entropy to the infrastructure causing it to be more prone to errors as it grows. It would also be preferred not to "orphanate" a resource.

Additional context

At large CTS is a great tool that I intend to use more and more which is why I believe its worth trying to submit this request. I think it is somewhat a bit too oriented towards enabling self-service from the perspective of the developers who need to push new apps and features and that by adding this simple flexibility CTS could be used to design much more complex infrastructures and take a leap beyond network automation into the realm of autonomous networks. While many brag about asynchronicity and elasticity, there is an enormous demand for asymmetricality and plasticity which can sound alike but offer their own spectrum of possibilities, especially for BYOD/mixed environments with on-premise/(multi-)cloud/IoT resources.

I hope this finds you well and that you see potential in this request!

Regards