[cinder-csi-plugin] Cloud Configs are Cached

spjmurray commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened:

When you change a password in OpenStack, any associated application credentials stop working. So some times it's necessary (to be fair, good practice) to rotate the cloud config. Sadly the CSI plugin (and I'll hazard a guess the rest of them), caches the old value, rather than going to the file system each time to check for an update, so it ceases working.

What you expected to happen:

Everything should expect the cloud config to change, pick up new versions and keep working.

How to reproduce it:

Replace the cloud config with a new one, deleting the old application credential. Provision a PVC and watch it complain the application credential is gone. Kick the pod and will suddenly start working again.

Anything else we need to know?:

I guess we could trick things into a reload by modifying the daemonset in some benign way, but it's sub optimal, and requires every client to remember to do this.

Also I'm up for doing the work, if the discussion goes well, seem easy enough on the surface. Famous last words!

Environment:

openstack-cloud-controller-manager(or other related binary) version:v1.25.0
OpenStack version: Zed
Others:

spjmurray commented 1 year ago

Of note, for once CAPO just worked, I'm guessing it must just read the Secret on each reconcile.

spjmurray commented 1 year ago

Note, the PoC code above is a little naive in that it creates a new client for each request. We could be more clever by doing a reflect.DeepEqual on config (which as we all know is flaky as hell given go's nil vs empty semantics), or we could marshal and do a string compare or something. I really don't see much of a performance benefit from this over-engineering though 🤷🏻 Prove me wrong!

kayrus commented 1 year ago

change a password in OpenStack, any associated application credentials stop working

Are you sure about this? Can you find an evidence that openstack changes the app creds hash according to an owner password?

spjmurray commented 1 year ago

No I'm not, after a test, seems my boss lied to/misled me 😞 Both CSI and CAPO autoscaling appear to work.

That said, we have had a number of OpenStack upgrades over the past few months, perhaps it was a bug that got fixed between Xena and Zed 🤷🏻

Irrespective, it would be nice still to have short lived application credentials so that and exposure of the secret is time limited. Obviously the situation is far more serious if you are using a username/password, but that's your own fault!

spjmurray commented 1 year ago

Or perhaps with hind sight, he was using username/password at the time...

mdbooth commented 1 year ago

Incidentally, CAPO is different in this regard: we can't cache these credentials as they can be different for each cluster. CPO is only running on a single OpenStack, so it's a reasonable and sensible thing to do here. If wasn't too complicated we would probably cache them in CAPO too, tbh.

spjmurray commented 1 year ago

So one prime example I'm thinking of why not to cache, or at least poll for changes is:

Person A creates the cluster with an application credential
Person A leaves the organization, so that application credential will go pop
We need to rotate the credential to a new owner in the organization

It's not unreasonable to have those changes picked up, rather than me -- as a service provider -- having to restart a bunch of services manually or with a boat load of custom logic, right?

Anecdotally @mdbooth caching would allow you to delete things in any order 😸

spjmurray commented 1 year ago

Actually, now I've thought, CAPO will use caching, in the sense that's how controller-runtime works, it just keeps things up to date with a watch under the covers.

I did consider using a similar approach here, where you can reinitialize the client with an informer on secret change. However... I see -- for security purposes undoubtedly -- the secret can also come from a hostpath mount, which pretty much kills off that idea.

But not all is lost, as I said the PoC code is very naive. Perhaps a go routine to poll the files on disk, when it sees a change it can atomically update the client, best of both worlds?

kayrus commented 1 year ago

How do you deploy CPO? IMO using a secret hash to trigger a new helm deployment should be enough. If you're using hostpath, where the secrets come from into the hostpath?

spjmurray commented 1 year ago

I see where you are going with this, and it's not that easy...

So we're creating the cloud config with https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/openstack-cloud-controller-manager/templates/_helpers.tpl#L51, then we reuse the secret defined there with the CSI plugin , so never actually have a full cloud config available within the CSI module to actually hash.

I guess if push comes to shove I could recreate this generation logic in our provisioner and just pass in raw data to both the OCM and OCP, that would make it available where it's actually needed.

I do worry though, philosophically, that using the annotation hack in 1000s of different projects around the world, when you could just poll for changes and act accordingly is the right thing to do. Opinionated, but 🤷🏻

mdbooth commented 1 year ago

I don't think that detecting changes in a file mounted in a pod is reliable, in general. The documentation says it will eventually be updated, but we'd probably have to poll it. I don't think we can rely on something like fsnotify because the mechanism used by the cri to expose it to the pod is an opaque implementation detail, and may work, work differently, or not work at all dependent on unknown factors.

If we were going to do this in the library, I believe a correct implementation could be:

Create a goroutine to poll the file periodically and store a hash
If a change is detected, cancel a context which causes a controller restart

We should still recommend using orchestration to restart the controller, but this could be a backup.

kayrus commented 1 year ago

Current gophercloud reauth implementation tries to reauth indefinitely. If we implement a provider.RetryFunc and catch 401, we can perform an action, e.g.:

interrupt a process: k8s should restart a pod with new credentials
dynamically re-read credentials and reauth a client

mdbooth commented 1 year ago

Current gophercloud reauth implementation tries to reauth indefinitely. If we implement a provider.RetryFunc and catch 401, we can perform an action, e.g.:

interrupt a process: k8s should restart a pod with new credentials

dynamically re-read credentials and reauth a client

The problem with this is the 'user surprise' I mentioned in the PR. As a user, if I rotate credentials I expect to see any problem with them immediately. It would be surprising, and make rollback harder, if the user didn't discover that the new credentials were invalid until the old credentials were removed.

kayrus commented 1 year ago

The problem with this is the 'user surprise' I mentioned in the PR.

In order to eliminate the "surprise" factor, the deployment must be updated with new credentials hash.

Create a goroutine to poll the file periodically and store a hash

It makes more sense to create a k8s client and watch for secrets update.

mdbooth commented 1 year ago

The problem with this is the 'user surprise' I mentioned in the PR.

In order to eliminate the "surprise" factor, the deployment must be updated with new credentials hash.

Create a goroutine to poll the file periodically and store a hash

It makes more sense to create a k8s client and watch for secrets update.

That's also what we do in OpenShift, and I think we'd be unlikely to use this feature if it merged. However, I also appreciate that just disabling the cache is simpler and may suit some use cases better.

spjmurray commented 1 year ago

I've updated https://github.com/kubernetes/cloud-provider-openstack/pull/2285 with another stab at PoC code. See what you think, state your mind, I know you will!

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 5 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/cloud-provider-openstack/issues/2284#issuecomment-2016768807): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / cloud-provider-openstack

[cinder-csi-plugin] Cloud Configs are Cached #2284