kubernetes / cloud-provider-openstack

Apache License 2.0
616 stars 603 forks source link

Possibly redundant calls to the Nova Metadata service #2224

Closed pierreprinetti closed 6 months ago

pierreprinetti commented 1 year ago

As discussed in #2217, calls to the Nova Metadata service could probably be reduced when the only goal is to retrieve the instance's ID.

Since the ID is never going to change, it could be fetched once and persisted in memory.

If there is consensus on a need for action on this one, I'd be glad to take it.

CC @zetaab @mdbooth @jichenjc

jichenjc commented 1 year ago

we try to read const instanceIDFile = "/var/lib/cloud/data/instance-id if metadata service got some problem and we think local might change https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instances.go#L514

but seems we should honor local value then try metadata and store the metadata result as cache (as proposed above) ..?

mdbooth commented 1 year ago

I think we can close this because we're already caching it: https://github.com/openshift/cloud-provider-openstack/blob/0eacab836b290551a6e058f7c6adb55747f1b591/pkg/util/metadata/metadata.go#L259-L260

So whether we're fetching from config drive or metadata, both GetInstanceID and GetAvailabilityZone will fetch the full metadata struct from the cache and therefore only ever hit the metadata service at most once. As both of these are static[^1] this seems fine.

[^1]: InstanceID can only be static. AZ can technically change, but only in a forced live migration by an OpenStack admin with an explicit destination host which violates the server's AZ constraint. This isn't a recommended way to do live migrations.

GetDevicePath doesn't use the cache, which is also correct. Fetching this every time is unavoidable.

There is a subtle problem with this code, though: setting the cache itself is racy. The cache is set in code which is called by the gRPC server, which is multi-threaded. 2 threads can obviously race to set the cache initially, but while that's a bit inefficient it's not actually a problem as they should both fetch and therefore cache the same data. However, setting a pointer in Go isn't an atomic operation, which means that reading the cache pointer and setting the cache pointer currently have undefined behaviour. This is probably the least problematic category of race conditions, but if we want to be 100% safe we should probably fix it with either a mutex or an atomic load and store.

@pierreprinetti close this and create a new issue for the race?

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/cloud-provider-openstack/issues/2224#issuecomment-2007086976): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.