add support for a Revision pre-start hook

duglin commented 4 years ago

In what area(s)?

/area API

Other classifications: /kind spec

Describe the feature

The notion of an application needing to connect to an external service is a common one. In the past Kube had the ServiceBroker project for this, and today there are other projects out there that are working on a similar set of features. That basic idea is that the environment tries to help the app author by creating a new instance of a service and then provides the connection info (url, creds, ...) via a secret - which is then mounted as a volume, or env vars, into the app.

When the app is already running, the user (or the service broker type of logic) can modify the config of the app (or ksvc) to include the pointer to the secret. However, the problem becomes harder when the user doesn't want the app to come up at all until the secret is attached to it because then it might generate failures while it's waiting for the secret to be attached.

This feature request is NOT to ask for ServiceBroker type of support in Knative. This feature request is for the ability for an admin to config some kind of pre-launch hook for a Revision so that they can add whatever logic they want before the KnServing controller tries to deploy the Revision.

This extensibility point could then be used for other purposes, but our immediate pain point is around Serving Bindings.

Some additional comments/thoughts:

admission web-hooks (that do some logic, like add the secret) won't work because we need the logic of these hooks to be done asynchronously at times - due to them potentially be long-running tasks
it's possible that the right place for this might be in Kube, but that's not clear since this might need to be part of some "higher level" logic that delays the creation of the Deployment until the hook's work is done
the reason Knative might be the right spot for this is because the user-flow that's being considered is that the user adds something like an annotation to the Ksvc to indicate they want some special logic to be done - in this case "connect to my instance of mySQL", e.g. --annotation serving-binding=dugsMySQL. And since we don't want to block the update/creation of the ksvc in etcd, we need some way to pause the serving controller so it waits until this work is done before it generates the Deployment for the Revision. In some ways this is sort of like the opposite of a Kube Finalizer
doing this above Knative could work, but then it might introduce some interesting challenges. For example, does it then make the "service binding" the top-level object on which the user is working on, rather than making it as an "extra config thing on the ksvc". Or would people to need to create a new CRD to represent ksvc+bindingRequest and then the ksvc is only created when the binding is ready? That, of course, forces a split in the user's workflow when they want to use serviceBindings.
it'll be interesting to see how this impacts the flow of the controller to introduce a 'pause', but in some ways it might not be that different from someone providing an invalid image name. Except in this case rather than treating it as an error, it just delays the next step in the reconciliation process.

Just looking for some brainstorming on this topic for now...

markusthoemmes commented 4 years ago

You can specify a ConfigMap/Secret mount with a specific path to an item in the ConfigMap. Pods will only start, once that item becomes available respectively. I don't quite recall if we're supporting that kind of a mount (yet).

That'd be a mechanism that lives purely in K8s, which I'd prefer over a custom mechanism on our side. There's also a concept called Initializers in K8s, see https://ahmet.im/blog/initializers/ for example. They read like the inverse of a Finalizer which would be exactly what you want for arbitrary initialization logic, that might not be surfaceable through ConfigMaps/Secrets.

duglin commented 4 years ago

re: configMap - yep I've thought about that and for quick demos it does seem to work. But at some point something gives up waiting. Not sure if its Knative or Kube or both, but I just tried letting it sit for a while and the Ksvc ends up in a "Failed" state eventually, and creating the CM after that time doesn't resurrect it. So, perhaps KnServing just needs to keep retrying forever? It might also need to not go into a "Failed" state, rather just sit in a "Pending" state.

re: Initializers - for some reason I thought people were moving away from those, but I suspect those might run into the same problem as configMaps w.r.t. something giving up after a while since the probes will probably fail

markusthoemmes commented 4 years ago

The exact error messages would be helpful, but I guess this will be the deadline we set on a deployment to become healthy. We only wait for an arbitrary amount of time today until we call failure.

duglin commented 4 years ago

Status:
  Conditions:
    Last Transition Time:        2020-04-14T12:34:10Z
    Message:                     Revision "echo-m759q" failed with message: configmap "mycm" not found.
    Reason:                      RevisionFailed
    Status:                      False
    Type:                        ConfigurationsReady

Which makes sense.

markusthoemmes commented 4 years ago

Oh yeah, my point above wasn't about the entire ConfigMap not being there but a key of it missing that explicitly state in the mount config.

duglin commented 4 years ago

ah, I don't think that would work because the app would still come up and then it would be up to it to watch for the CM entry (and only by vol mount, not env vars, which some people want to use) and I'd like to make this transparent for the user.

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

duglin commented 4 years ago

/remove-lifecycle stale

mattmoor commented 4 years ago

/lifecycle frozen

@duglin I haven't read through this in detail, but is this something you want to discuss at an upcoming API WG meeting?

duglin commented 4 years ago

Yep I think that would be a good idea - just let me know which week you think would good for it.

duglin commented 4 years ago

@mattmoor gonna add this to this week's agenda... ran into another situation where this might be useful

duglin commented 4 years ago

We're running into a situation where we need to disable an account. This means that we don't want to delete the user's CRs, but we do want to delete the running pods associated with it. We also don't want to modify the user's CRs (eg set things to max-scale=0) because then we'd need to restore the old values once the account is re-enabled. Possible to do, but not ideal.

While brainstorming on ways to achieve this, this issue/feature came up and if this feature were viewed more as a "pause" state then it might be able to work more multiple scenarios. For the "pre-start hook" scenario, it would prevent the pods from even starting until something moved the ksvc out of "paused" state. For the "disable the account" scenario, the controller would view the "paused" state as a signal to not only not bring up new pods but also shutdown existing ones.

Just adding another aspect to the discussion.

Not sure yet what disabling the account would do for eventing resources. The obvious answer is to delete secondary resources associated with the CRs (e.g. unsubscribe from github), but more thinking is needed to ensure that won't cause unintended harm.

mattmoor commented 4 years ago

cc @n3wscott as this topic of suspecting sources has come up before 🤔

n3wscott commented 4 years ago

*Suspending

Yes, a pause feature to allow you to build up a "hot" cluster and be able to switch over to it and turn the sources on and have it ready to go. A disaster recovery type cluster.

duglin commented 4 years ago

That would be awesome - we have a similar need - the domain mapping stuff was part of that need too :-)

n3wscott commented 4 years ago

Pause is on the roadmap for sources in the "soon" category, like 6 months.

duglin commented 4 years ago

Got any kind of design doc yet?

scothis commented 4 years ago

We're running into a situation where we need to disable an account. This means that we don't want to delete the user's CRs, but we do want to delete the running pods associated with it.

If you set a ResourceQuota to allow zero Pods, will it terminate running Pods?

julz commented 4 years ago

@scothis looks like no according to the docs ("Neither contention nor changes to quota will affect already created resources."), but you could probably set one and then stop all pods in a namespace to achieve a ~similar effect.

duglin commented 4 years ago

Another possible need for a pre-start hook... build. While I suspect that many people think of "build" and "create ksvc" as two separate steps, and they're done sequentially, it actually makes it kind of awkward in a UI where from the user's POV they've asked for a ksvc to be created based on a yet-to-be-built image.

Abstractly, if there's a UX that presents something like:

service create echo --image myco/echo --source github.com/myco/echo

which means create an echo service from the source code in github/myco/echo. Now, it's very possible that the service command could orchestrate everything and only create the underlying ksvc after the build is done, however, since the ksvc won't be visible until late in the process it means the user doesn't know that the echo ksvc is "pending" if they list the ksvcs in the system. It would be nice to create the ksvc but don't let it actually deploy the revision until it was "unpaused" (meaning the image is ready).

Another interesting option here is to allow for kServices to not have any revisions - while it's not the same thing as "pausing a revision", it does solve the "pausing a ksvc" side of it.

duglin commented 3 years ago

Related to the previous comment... not creating the ksvc until the build is done means that we can't "reserve" the ksvc name until that time either. So, someone else could grab our ksvc name while our build is happening - even though it was available when the first use initiated the "service create" command.

A mechanism like this would allow for any kind of "pausing" regardless of the reason (building, service bindings, async checks, simply reserving the name for later user....). This issue isn't about supporting just "build" but rather supporting a "pausing" mechanism.

evankanderson commented 3 years ago

https://github.com/k8s-service-bindings/spec seems to have an example of the service-binding type pattern that Doug was talking about.

It seems like this issue has changed from "run some code before creating a Revision" to "be able to pause resource reconciliation for an arbitrary amount of time for specific resources". (I agree with julz and scothis that using ResourceQuota in combination with killing Pods seems like the better way to restrict running containers for an account.)

This seems like a pretty big feature space -- a concrete proposal for specific use cases would probably help narrow it down if we want to take action (vs hold open a feature request).

/kind proposal /triage needs-user-input

duglin commented 3 years ago

Not sure the issue has changed, it's about being able to do some work before the revision is reconciled. I'm not sure Quota can help here since it's not about pausing ALL Revisions, it's about pausing ONE Revision. In terms of usecases, the two main ones we have are:

do something similar to service bindings where there's processing needed to setup things (like secrets for creds to remote services), and add the links to those secrets into the revision's podSpec
do something more drastic like actually build the image used by the Revision before the system complains with a 404

But obviously, the mechanism should support any logic/reason. As I mentioned, I view this similar to a Finalizer, which pauses the delete until code is run. Except this is pausing the creation until some code is run.

See https://github.com/knative/serving/issues/7575#issuecomment-717902765 for why doing this above ksvc/revision is problematic.

julz commented 3 years ago

@duglin in concrete terms, are you proposing something like a spec.Paused field that causes the reconciler to ignore Ksvc changes if it's true? (I think a Paused field would be consistent with regular k8s Deployments which can be paused, so that seems reasonable to me at least, but I'm really not clear from this issue if that's what you're proposing or if there's something more/different being asked for)

duglin commented 3 years ago

I think it might need to be more than a single field. In the first comment I talk about how this is similar to a Finalizer, but on create. So my prototype is along the lines of:

add a check to the revision.reconcile to see if there's a label of the form: serving.knative.dev/pause-XXX: YYY, and if so then "return"

And that's pretty much it. Then some other component can add labels like service.knative.dev/pause-service-binding: myservice during the creation of the ksvc yaml. And there could be more than one since there may be multiple reasons for the pause. And some other component and look for these labels on the revision and remove it when it's processed whatever logic is associated with that label.

And, I made it a label instead of an annotation so that the controller watching for these labels can easily find/watch for them.

duglin commented 3 years ago

Feature proposal doc: https://docs.google.com/document/d/1GkVceSS_v1TCkG_gccl4w3DwgmKt_to3VsbBXyqRe28/edit# for our chat during next week's call.

evankanderson commented 3 years ago

/remove-triage needs-user-input

evankanderson commented 3 years ago

/triage accepted

knative / serving

add support for a Revision pre-start hook #7575

In what area(s)?

Describe the feature