Design: Reconsider the semantics of `cancel.Wrap`

iwahbe commented 5 months ago

Hello!

Vote on this issue by adding a 👍 reaction
If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Following up on the discussion from https://github.com/pulumi/pulumi-go-provider/pull/207, it seems like the as-designed behavior of the cancel middleware is unintuitive/unappealing to some.

The current behavior as described by cancel's doc comment (PR pending) is:

// The `cancel` package provides a middle-ware that ties the Cancel gRPC call from Pulumi
// to Go's `context.Context` cancellation system.
//
// Wrapping a provider in `cancel.Wrap` ensures 2 things:
//
// 1. When a resource operation times out, the associated context is canceled.
//
// 2. When `Cancel` is called, all outstanding gRPC methods have their associated contexts
// canceled.
//
// A `cancel.Wrap`ed provider will still call the `Cancel` method on the underlying
// provider. If NotImplemented is returned, it will be swallowed.

This issue is a great place to comment on why the existing design should change.

Affected area/feature

iwahbe commented 5 months ago

For context, the bridge completely ignores the Cancel method:

SDK:

// Cancel requests that the provider cancel all ongoing RPCs. For TF, this is a no-op.
func (p *Provider) Cancel(ctx context.Context, req *pbempty.Empty) (*pbempty.Empty, error) {
    return &pbempty.Empty{}, nil
}

PF:

// SignalCancellation asks all resource providers to gracefully shut down and abort any ongoing operations. Operation
// aborted in this way will return an error (e.g., `Update` and `Create` will either a creation error or an
// initialization error. SignalCancellation is advisory and non-blocking; it is up to the host to decide how long to
// wait after SignalCancellation is called before (e.g.) hard-closing any gRPC connection.
func (p *provider) SignalCancellationWithContext(_ context.Context) error {
    // Some improvements are possible here to gracefully shut down.
    return nil
}

t0yv0 commented 5 months ago

The only point of implementing cancel would be to make some operation atomic with respect to cancellation to reduce the likelihood of exposing "partial states" exposed by interrupting such an operation in flight. Can we think of any? The stateful operation I'm thinking about is creating or updating a resource. It can be non-atomic if for example the provider creates a bucket first and then applies tags. Does this apply to the command provider, does it benefit from graceful cancel of a command? I'm apriori doubtful if any of the providers can meaningfully "gracefully cancel" these and would appreciate concrete evidence that it's helpful before throwing effort at implementing Cancel better at the generic level.

iwahbe commented 5 months ago

The only point of implementing cancel would be to make some operation atomic with respect to cancellation to reduce the likelihood of exposing "partial states" exposed by interrupting such an operation in flight.

I don't agree here. Implementing Cancel at a resource level can be used in at least 2 ways. To either:

Make a resource operation atomic by reverting a partial change.
Record a partial change into the state so a subsequent pulumi up knows about the partial state.

I can't think of how either of these would apply to pulumi-command, but I do think they would apply to a traditional cloud provider. They do apply in the bridge.

does it benefit from graceful cancel of a command? I'm apriori doubtful if any of the providers can meaningfully "gracefully cancel" these and would appreciate concrete evidence that it's helpful before throwing effort at implementing Cancel better at the generic level.

Yes. The best way to illustrate this is to compare the behavior of pulumi-command and an alternate build with the cancel middleware removed:

The command provider benefits from handing Cancel at the generic level by shutting down quickly with correct error messages. When Cancel is sent the the command provider while executing a lengthy command (via C-c), the response is instant:

Do you want to perform this update? yes
Updating (dev)

View in Browser (Ctrl+O): https://app.pulumi.com/pulumi/dev-yaml/dev/updates/134

Loading policy packs...

     Type                      Name          Status                  Info
     pulumi:pulumi:Stack       dev-yaml-dev  **failed**              1 error
 +   └─ command:local:Command  sleep         **creating failed**     1 error

Diagnostics:
  command:local:Command (sleep):
    error: signal: killed: running "sleep 10":

  pulumi:pulumi:Stack (dev-yaml-dev):
    error: update failed

Resources:
    1 unchanged

Duration: 7s

The provider shuts down cleanly and quickly. Pulumi feels responsive.

When running against a build of pulumi-command that excludes the cancel middleware, the provider ignores the first C-c sent to it. This feels unresponsive. On the second C-c, the provider is killed:

Updating (dev)

View in Browser (Ctrl+O): https://app.pulumi.com/pulumi/dev-yaml/dev/updates/137

Loading policy packs...

     Type                      Name          Status      
     pulumi:pulumi:Stack       dev-yaml-dev              
 +   └─ command:local:Command  sleep         created     

Resources:
    1 unchanged

Duration: 8s

error: update cancelled

Pulumi feels less responsive because the first C-c was ignored. Because the resource can't communicate back any error, the user can't see that Cancel was called while sleep 10 was running. This feedback might be significant if the command being run was aws s3 cp src dst, where partial states are possible.

In addition, the pulumi up display is wrong: sleep wasn't created and pulumi up will correctly try to create it again when pulumi up is run again.

I think the command provider experience is a pretty strong argument that providers need to exit when Cancel is called, and the best way we can get that "for free" IMO is with the cancel middleware's current semantics.

If a provider wants to customize the way it handles Cancel, it can do so at a resource by resource level simply by handling the cancelation or timeout of its associated context.Context. I expect most providers will not handle Cancel at the resource level, and canceling a request will simply return with an exit, a significant improvement in responsiveness over default behavior.

t0yv0 commented 5 months ago

Pulumi experience is specifically designed so the first Ctrl+C is not killing instantly so that providers can gracefully terminate. The second Ctrl+C does kill immediately. I think I can find refs for this in the p/p interrupt handler somewhere.

So for a provider, instead of ignoring Ctrl-C you can:

a) kill the process in response to Cancel.
b) if Cancel can block, then block until there're no gRPC requests outstanding and return; these protects gRPC requests as atomic; I'd expect this is available on gRPC server level

Either strategy feels like something with very few LOC and no concurrent code to worry about. To justify something sophisticated I'd love to see some worked out examples. I don't know of the bridge capability to gracefully cancel partial creates. I'm not sure what you mean by partial state in the engine, you mean "create failed" state? Any references there?

iwahbe commented 5 months ago

Pulumi experience is specifically designed so the first Ctrl+C is not killing instantly so that providers can gracefully terminate. The second Ctrl+C does kill immediately. I think I can find refs for this in the p/p interrupt handler somewhere.

That is my understanding. That means that the provider gets exactly one Cancel to respond to, since it can't respond to the second Ctrl+C.

a) kill the process in response to Cancel.

b) if Cancel can block, then block until there're no gRPC requests outstanding and return; these protects gRPC requests as atomic; I'd expect this is available on gRPC server level

Killing the process yourself doesn't help much, and Cancel can't block. Instead providers need to shut down gracefully as fast as possible (before a user sends the second Ctrl+C).

Coincidentally, a motivating example came up during hack week: https://github.com/pulumi/pulumi-go-provider/pull/210. pulumi-pulumiservice has a non-atomic create operation for Team, so it needs to handle created but uninitialized resources. Not responding instantly to Cancel increases the odds that we will leak a Team in the service, which we will always do if the provider is killed after the team has been created but before the Create call returns.

t0yv0 commented 5 months ago

Right, right I'm with you that killing the provider is not always great answer, but if you're saying cancel cannot block, what are you saying will happen in the service provider in that scenario? Are you saying that Cancel handler will issue cancellation over context.Context object which will make Team.Create abort work but respond to the engine with "InitFailed" so that can be saved in state? I mean this:

// 2. When `Cancel` is called, all outstanding gRPC methods have their associated contexts
// canceled.

I haven't seen the code referenced that actually responds to the cancelled context, but skimming the docs just now it might be implicit in lower level networking code... So any kind of request handlers will stop waiting.

So this makes sense if we have some form of error recovery, like saving errors into state with &rpc.ErrorResourceInitFailed (which I still would love a quick page on what that does precisely). If you want to speed up this error recovery this makes total sense. If it was just injecting unrecoverable errors it didn't make sense to me as it seemed to be swapping one way to crash for another way to crash.

iwahbe commented 5 months ago

Are you saying that Cancel handler will issue cancellation over context.Context object which will make Team.Create abort work but respond to the engine with "InitFailed" so that can be saved in state?

Yes! That's exactly what I'm saying.

I haven't seen the code referenced that actually responds to the cancelled context, but skimming the docs just now it might be implicit in lower level networking code... So any kind of request handlers will stop waiting.

Yes. Any high quality library with ongoing operations should cancel immediately. This includes go's standard library, for both network requests and subprocesses. Most providers (pulumi-command and pulumi-pulumiservice included) will get instant cancelation for free.

pulumi / pulumi-go-provider