pulumi / pulumi

Pulumi - Infrastructure as Code in any programming language 🚀
https://www.pulumi.com
Apache License 2.0
20.48k stars 1.06k forks source link

Support custom logic in resource lifecycle #1691

Open lukehoban opened 5 years ago

lukehoban commented 5 years ago

Today, for a given resource, its behaviour during Create, Update and Delete operations is fully specified by the underlying resource provider.

However, often there is additional custom logic that a program wants to apply as part of one of those operations.

For example, perhaps there is a custom notion of health check for a given EC2 Instance (or other compute resource), and the Create or Update steps should not register completion until those health checks complete (and should fail the update if the health check fails).

This could be supported in Pulumi by offering lifecycle hooks as callbacks. Since the deployment program is a general purpose programming language, these custom lifecycle hooks can be written using arbitrary logic in JavaScript/Python/etc.

let instance = new aws.ecs.Instance("instance", { 
  ami: "myami",
}, {
  postCreate: async () => {
    // make HTTP requests against the health check endpoint on the instance's public IP 
    // until they succeed, throwing an Error if this process times out.
  },
});

Related to https://github.com/pulumi/pulumi/issues/99 and https://github.com/pulumi/pulumi/issues/127.

hausdorff commented 5 years ago

This is the medium-term fix for https://github.com/pulumi/pulumi-kubernetes/issues/261, we should try to make this in M19 if we can.

lukehoban commented 5 years ago

@pgavlin Could you share your initial design thoughts on this? I assume we'll tackle in M20.

mgasner commented 5 years ago

:+1: My use case is that I'm having issues using pulumi to destroy Amazon EMR clusters when I specify custom managed security groups. EMR adds rules to those security groups that create a circular dependency, and without a pre-destroy hook there's no way for me to tear down the resources cleanly.

lukehoban commented 5 years ago

Let's see if we can flesh out a design for this in M20.

jen20 commented 5 years ago

@mgasner The use case of security groups being destroyable once EMR has had at them with adding additional rules is one that can actually be resolved today, by setting revokeRulesOnDelete: true on the security group itself. This comes from the underlying Terraform provider - the doc for that property:

Revoke all of the Security Groups attached ingress and egress rules before deleting the rule itself. This is normally not needed, however certain AWS services such as Elastic Map Reduce may automatically add required rules to security groups used with the service, and those rules may contain a cyclic dependency that prevent the security groups from being destroyed without removing the dependency first.

lukehoban commented 5 years ago

@jen20 @pgavlin I believe you guys fleshed out a design proposal here last week, could you update this issue with the current plan?

pgavlin commented 5 years ago

After lengthy discussion, we believe that we have a plan for post-create provisioning in the short term.

If the a resource's onCreate callback fails, the resource will be tainted, its outputs' underlying promises will be rejected, and the program will fail.

We will not address pre-delete callbacks at this time. The fundamental problem with such callbacks is that they require the callback's program to be available at the time the resource is destroyed. A naive implementation such as running the user's program before performing the delete is problematic: the removal of a resource from the user's program is a common cause of deletion, so the callback's program is not available. In essence, a pre-delete callback's program must become part of its resource's state (or must otherwise be made available independently of the resource's definition in the user's program).

atistler commented 4 years ago

Any update on this? CRUD event callbacks would make pulumi super powerful.

geekofalltrades commented 4 years ago

We're trying to deploy a private HashiCorp Vault cluster in Kubernetes, and having these lifecycle hooks would solve two different aspects of it.

First, we deploy Vault in a Kubernetes Deployment with three replicas for HA. We use its /sys/health endpoint for Kubernetes healthchecking, which causes the one leader Vault pod to become fully ready, while the two standby Vault pods do not become ready. This is desirable because it means that the standby Vault pods are not added to the Service load balancer; only the leader pod is. However, this means that, from Pulumi's perspective, this Deployment never becomes ready. We end up having to turn off the await logic for it. This reduces our ability to orchestrate the remaining steps of our deployment based on Vault's actual readiness.

If we had these hooks, we could add custom await logic saying "this Deployment is ready once any one of its Pods reports as ready."

Second, this Vault instance is internal to our Kubernetes cluster, and we can't reach it from outside. In our current custom deploy pipeline that we're hoping to replace with Pulumi, we start a kubectl port-forward to the leader Vault pod so that we can reach it to configure it.

With these hooks, I could start a port-forward during resource creation when using the pulumi_vault provider.

lukehoban commented 4 years ago

For folks looking for solutions here - please do check out https://github.com/pulumi/examples/tree/master/aws-ts-ec2-provisioners, which shows how to use a simple library on top of "dynamic providers" to enable this. We're still looking at adding more first-class support for this in the near future - but the library used in that example should work for many use-cases in the meantime.

nesl247 commented 4 years ago

Any plans for this anytime soon? This is really important to be able to communicate to systems such as GitHub for the deployments.

riongull commented 4 years ago

@lukehoban , has Pulumi considered incorporating a model similar to React hooks for managing Pulumi state, side effects, and component lifecycle? It enables cross component state/effects management with a simple functional API for users. The best part is the model facilitates authoring custom hooks as well (which might be the analogue to pulumi.dynamic.Provider in this paradigm).

I'm not smart enough to put this all together, but it seems like this issue (and related issues) is bumping up against the same type of problems the React team was trying to solve with React hooks (a simple API that enables cross component immutable state management and reusable/sharable custom logic as well).

rjshrjndrn commented 3 years ago

If we have got this lifecycle hooks available, it'll be easier for creating much more abstracted layers like operators in k8s. For my use-case, creating a Cassandra or elastic search cluster. When a node is added, that will then go thorough a set of post-process steps, for provisioning, configuring and finally adding Cassandra into the cluster. That'll be a huge boost for the community also, kind of Cassandra module or ES module

thomshib commented 3 years ago

As per the Pulumi Roadmap, this feature will be developed in Q4, 2020. Are there any changes in the Roadmap and would this item be addressed in Q4, 2020

cyclingwithelephants commented 3 years ago

Given that it's Q1 2021 and this ticket isn't closed, has this been re-prioritised or is something like this implementable?

When you enable GCP API's since they return after they're enabled but before they're available for use - a simple dependsOn won't work in this situation. In terraform, the workaround is to sleep 120 as a local-exec and it would be great to be able to do this (or even better, polling an endpoint similarly to the original poster) in Pulumi.

leezen commented 3 years ago

This has been re-prioritized and the Pulumi team won't be tackling this this quarter.

@cyclingwithelephants in your specific use-case, you might consider using a dynamic provider (https://www.pulumi.com/docs/intro/concepts/programming-model/#dynamicproviders) or, depending on what you need to do, a simple apply with an API call to check for readiness within it.

almson commented 3 years ago

I noticed that @pulumi/eks uses an apply trick to make sure the EKS cluster is available before returning. Can someone explain a bit more on how it works? Does it run on every up/preview or only when the resource is updated?

gitfool commented 3 years ago

@almson it effectively waits until the api server health check passes or times out, and only runs during updates, so skips previews.

lkt82 commented 3 years ago

Any news on this? I have a need to Run a Action before a resource is destroyed. Using C# so can not use a dynamic provider.

irl-segfault commented 2 years ago

An option for achieving this is to allow arbitrary Outputs to be injected into the DAG. You could use the output of the instance creation to run the custom logic, the result of which could be modeled as an Output that subsequent resources could dependOn

An example of this in pseudocode would be

instance := ec2.NewInstance(...)

healthCheckOutput := instance.ARN().ApplyT(func(s) bool { 
    // make API calls to do healthcheck
    return true
  })
nextResource := pkg.NewResource(..., pulumi.DependsOn(healthCheckOutput))

Something along those lines. Not sure how you'd handle failure here.

See https://github.com/pulumi/pulumi/issues/2545#issuecomment-875916375 where I reiterate this.

djsd123 commented 2 years ago

Concur with @cyclingwithelephants. I have a module/custom component to provision aws accounts within an organisation. However, resources that are to exist within the account fail on first run for this very reason.

Terraform:

resource "aws_organizations_account" "account" {
  provider = aws.account
  name     = var.name
  email    = var.root_account_email

  provisioner "local-exec" {
    # AWS accounts aren't quite ready on creation, arbitrary pause before we provision resources inside it
    command = "sleep 120"
  }
}

Pulumi:

        const account = new organizations.Account(`${args.name}-account`, {
            name: args.name,
            email: args.rootAccountEmail,

            tags: args.tags
        }, { provider: args.orgAccountProvider, provisioner: { ...SOME WAY TO SLEEP  })
jonsherrard commented 2 years ago

Ooooh mercy I am con fuddled, it'd be neat if there were some comprehensive docs on workarounds. For these issues.

I've a very basic program where I need to wait for an AWS SSL Certificates to complete before I can go and start the rest of the infrastructure build. I don't want to do argument-drilling, and pass the certificate instance down through all of my componentsResources, and nested componentResources. And it just seems impossible unless I do that? I suppose there's an argument to say: "If this component needs a certificate instance to be built in order to run, it should be passed in, or part of the component." - But... it's a wild-card SSL certificate shared by lots sub-programs. So I'm using getCertificate deeper into the graph. Hmmm. Not sure what I think really.

I just want a simple API to say, do not instantiate instantiate these custom components until new aws.acm.Certificate() is complete. Or simple directions in the doc on how to achieve this effect.

Yours confusedly,

Jon x

dustinboss commented 2 years ago

@jonsherrard Not sure if this will fit your problem, but there is a CertificateValidation class that might be what you're looking for. I believe it represents the stage where the certificate is being provisioned by AWS.

I think that the issue with certificates is that AWS actually creates the Certificate record before it's ready to be used. It gets created, and then AWS still has to provision/validate it. That's the part that you actually want to wait on.

Here's how I do it -- but I wrote this code almost two years ago now, so there may be updates to the API since then:


const DOMAIN_NAME = 'sub.domainname.com'
const ROOT_DOMAIN = 'domainname.com'

// Route53 Zone for base domain
const route53Zone = pulumi.output(aws.route53.getZone({
  name: ROOT_DOMAIN, // <-- the root domain that matches the route53 zone
  privateZone: false
}));

// Provision an SSL certificate to enable SSL
const certificateRegionName = 'us-east-1'; // must be in us-east-1 for the API Gateway
const certificateRegion = new aws.Provider("ssl-cert-region", { region: certificateRegionName });
const sslCert = new aws.acm.Certificate("ssl-cert", {
  domainName: DOMAIN_NAME, // <-- something here probably needs to change for the wildcard domain
  validationMethod: "DNS"
}, { provider: certificateRegion });

// Create the necessary DNS records for ACM to validate ownership, and wait for it.
const sslCertValidationRecord = new aws.route53.Record("ssl-cert-validation-record", {
  zoneId: route53Zone.id,
  name: sslCert.domainValidationOptions[0].resourceRecordName,
  type: sslCert.domainValidationOptions[0].resourceRecordType,
  records: [ sslCert.domainValidationOptions[0].resourceRecordValue ],
  ttl: 10 * 60 /* 10 minutes */
});

// Create the Certificate Validation object
const certificateValidation = new aws.acm.CertificateValidation("ssl-cert-validation-issued", {
  certificateArn: sslCert.arn,
  validationRecordFqdns: [ sslCertValidationRecord.fqdn ]
}, { provider: certificateRegion });

Then, you can use the output from the Certificate Validation instead of from the Certificate directly. Like this:

new aws.apigateway.DomainName('domain-web-cdn', {
  certificateArn: certificateValidation.certificateArn,
  domainName: DOMAIN_NAME
}) 
jonsherrard commented 2 years ago

Thanks for the help @dustinboss, I was pseudo-coding a bit there, my certificate-contains componentResource is a bit more sophisticated and does include a certificateValidation child.

I tried dismantling my component and putting the certificateValidation instance in the top-level, but still no dice.

ModernRonin commented 2 years ago

I arrived here from the link in this stackoverflow thread.

Like the original poster there, I am interested only in lifecycle-hooks on the project/stack level, not on a resource level. As in: get called when an actual update is starting (after the preview/calculation phase) and when it's finished (with info about whether it succeeded or failed). This simply to be able to send out notifications of the form "the environment XY is going down for maintenance" and "the environment XY is back up again".

Of course I can work around by wrapping the pulumi call into a script and sending these notifications before and after that call. But that gives a much longer time window between the two notifications than is actually necessary.

So the question is: do you have support for this (supposedly) much simpler scenario?

AmitArie commented 2 years ago

any updates?

rjshrjndrn commented 2 years ago

Folks, I know you guys are busy with so many other priorities. But, If you can at least mention whether it's in the pipeline, that's worth waiting for.

jonsherrard commented 2 years ago

@rjshrjndrn Screenshot 2021-12-02 at 07 36 49

It's labelled as such in the Projects area of this very issue.

rjshrjndrn commented 2 years ago

Aha... Sorry @jonsherrard.. :sweat_smile: Obviously seeing is not observing.

lukehoban commented 2 years ago

Earlier this week we released a new Pulumi Command package in the Pulumi registry which addresses the core use cases tracked here, running scripts locally or remotely on a target VM as part of the Pulumi resource lifecycle. This new package is supported in all Pulumi languages.

Notably, it is focused only on running shell scripts and copying files. It does not (yet) support running code inside the Pulumi language your deployment is authored in from within the resource lifecycle.

We've closed #99 as part of releasing this new package, and there are some more details on how it can be used for script and command provisioning use cases in https://github.com/pulumi/pulumi/issues/99#issuecomment-1003445058.

We'll keep this issue open to track deeper support for language-level hooks that can run within the resource lifecycle as discussed in the original issue description above.

But for many users who have come across this issue, their use case is quite likely supported via the new Command package.

ghostsquad commented 2 years ago

I'd like to also note, that if you have K8s available, you can implement at least "post-success" with a K8s Job. I believe Pulumi deploys and runs a job to completion. The success of the job indicates success of the resource object deployment. There's still the "on-failure" that I'd like to see. If this resource fails to update/create successfully, then here's a fallback branch.

jnovick commented 2 years ago

I agree regarding the on failure hook. For me, the use-case is that I would like to log more info around the failure to empower individuals to see the source of the problem better. For example, in a helm release, it sometimes reports failure due to a health check failure or a secret not being found. It is easy to run a kubectl command to find these answers on why it failed instead of just saying failed. I just want to be able to print that within the pipeline to provide more context to my developers.

pawelprazak commented 1 year ago

I'd like to add a use case for this feature:

ghostsquad commented 1 year ago

I'd like to add a use case for this feature:

  • open a port-forward tunnel to a Kubernetes pod (e.g. using kubernetes client)
  • connecting to the workload's API using the port-forward (e.g. a database) to manage resources (e.g. using a dynamic provider)
  • closing the port-forward (to close the port and allow the Pulumi program to finish)

This is the perfect use case for a job.

pawelprazak commented 1 year ago

@ghostsquad could you elaborate a bit?

ghostsquad commented 1 year ago

@ghostsquad could you elaborate a bit?

Of course! So the basic idea is to encapsulate the work you want to do into a docker image + env vars/config map and deploy that to the cluster. The job must run and complete successfully for Pulumi to report success.

You would have to double check the definition of "job health". At least this is how it works with other systems that do health checking for Kubernetes.

This is the method I used to use to do database migrations, and this is also natively supported in ArgoRollouts as a way to run a custom check as part of a canary deployment.

pawelprazak commented 1 year ago

Oh, so Kubernetes Job is what is being suggested, unfortunately this is not what had in mind.

I've wrapped Typesense API client into a dynamic provider that I would have many interactions with, both using Inputs and Outputs.

Your idea could probably be viable workaround, but I'd rather split my automation into two Pulumi Stacks and maybe wrap with Automation API to do what I have in mind.

Having said that, I strongly believe Pulumi needs to be able to express this type of "transactional" temporary states to be able to express complex automations that interact with many layers at the same time and require specific conditions for the deployment to succeed.

Another example use case would be just-in-time IAM assumptions and revocation, where a specific role needs bo be assumed to accomplish various parts of the deployment and then revoked after completion.

RobbieMcKinstry commented 1 year ago

I think the Vault cluster example from the beginning of this thread is quite illustrative for me. A few months ago I also tried to spin up a Vault cluster with Pulumi, but quickly realized I was going to have to connect to a Vault instances living in K8s to unseal it. It can be done in only a few steps with the Command provider, but that requires setting up kubectl, finding the right pod, and dumping the unseal key somewhere. Would be easier to facilitate programmatically via the Vault API instead of Bash scripting.

RobbieMcKinstry commented 1 year ago

I tend to conceptualize lifecycle work as a form of policy, myself.

ghostsquad commented 1 year ago

Another example use case would be just-in-time IAM assumptions

You can do this by simply using a different provider (configured with a different role) and passing that as a resource option

pawelprazak commented 1 year ago

Another example use case would be just-in-time IAM assumptions

You can do this by simply using a different provider (configured with a different role) and passing that as a resource option

Yes, if I want to let the assumption just expire, but there is no way to self-revoke after its no longer needed. (Also I'm not sure that re-running the program would work, because the resource would already exist)

If there would be a way to create a temporary resource that lives only until all of its dependencies are deployed and then removes itself, this would be a way to seamlessly model this behaviour within Pulumi model.

Frassle commented 1 year ago

Having said that, I strongly believe Pulumi needs to be able to express this type of "transactional" temporary states to be able to express complex automations that interact with many layers at the same time and require specific conditions for the deployment to succeed.

I think those probably don't come under the same work as adding lifecycle hooks, but I agree that support for that would unlock a lot of workflows.

DanielRailean commented 1 year ago

hey everyone, any status updates on this ?

PatMyron commented 1 year ago

snapshotting pre-deletion is common functionality: https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/224

same-id commented 1 year ago

Isn't ApplyT in a way a hook on an output of an object?

If only we could get access whether it's postCreate and not just always run the logic in the ApplyT function...

IreshMM commented 7 months ago

I would also love to have this feature. Any progress on this?

Frassle commented 7 months ago

I would also love to have this feature. Any progress on this?

There has been some progress towards this in terms of some foundation work that other features are going to be putting in place.

ysaakpr commented 4 weeks ago

I see that this is a ticket open for last 6 years. How are you guys doing it as of now. I have a case, In we run apps using gcp cloudrun services. But before everytime the service resource creates/updates, i want to run a gcp cloudrun Job for database migrations. I was unable to get this working reliably. I could get this working for the first time create with service resource depending on localCommand. but its not taking care of the future update or preUpdate ..

Did anyone have a relaible way to handle this

UnstoppableMango commented 4 weeks ago

If by "localCommand" you're referring to the "pulumi-command" provider, there should be inputs for an update command as well as triggers to control when things run. Are these not meeting your needs in some way?

If you were referring to something else, then I would recommend checking out https://github.com/pulumi/pulumi-command. It may not be a perfect solution but I've been able to leverage it to cover a variety of use cases.

tomdavidson commented 4 weeks ago

@ysaakpr use a separate CD task than Pulumi? A think only needs to be a Pulumi resource if you want it on the Pulumi dependency DAG and it does not quite make since to have any resources that you cant delete. IMHO if any of the CRUD verbs do not make sense for the use case then keep it outside of Pulumi's world. With that said there are several TF migration related providers, but it doesnt quite jive with me. Id much rather use an actual db migration tool over a infra as code tool.

As for db migrations, are changing the schema when provision infrastructure or when your updating your software - using Pulumi to build and use a new docker image? You could have the docker service preform the migration before advertising its ready. In k8s you can use k8s jobs or init containers...