kubernetes-csi / external-provisioner

Sidecar container that watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume against a CSI endpoint
Apache License 2.0
342 stars 332 forks source link
k8s-sig-storage

CSI provisioner

The external-provisioner is a sidecar container that dynamically provisions volumes by calling CreateVolume and DeleteVolume functions of CSI drivers. It is necessary because internal persistent volume controller running in Kubernetes controller-manager does not have any direct interfaces to CSI drivers.

Overview

The external-provisioner is an external controller that monitors PersistentVolumeClaim objects created by user and creates/deletes volumes for them. The Kubernetes Container Storage Interface (CSI) Documentation explains how to develop, deploy, and test a Container Storage Interface (CSI) driver on Kubernetes.

Compatibility

This information reflects the head of this branch.

Compatible with CSI Version Container Image Min K8s Version Recommended K8s Version
CSI Spec v1.9.0 registry.k8s.io/sig-storage/csi-provisioner 1.20 1.31

Feature status

Various external-provisioner releases come with different alpha / beta features. Check --help output for alpha/beta features in each release.

Following table reflects the head of this branch.

Feature Status Default Description Provisioner Feature Gate Required
Snapshots GA On Snapshots and Restore. No
CSIMigration GA On Migrating in-tree volume plugins to CSI. No
CSIStorageCapacity GA On Publish capacity information for the Kubernetes scheduler. No
ReadWriteOncePod Beta On Single pod access mode for PersistentVolumes. No
CSINodeExpandSecret GA On CSI Node expansion secret No
HonorPVReclaimPolicy Beta On Honor the PV reclaim policy No
PreventVolumeModeConversion Beta On Prevent unauthorized conversion of source volume mode --prevent-volume-mode-conversion (No in-tree feature gate)
VolumeAttributesClass Beta Off Pass VolumeAttributesClass parameters during CreateVolume --feature-gates=VolumeAttributesClass=true
CrossNamespaceVolumeDataSource Alpha Off Cross-namespace volume data source --feature-gates=CrossNamespaceVolumeDataSource=true

All other external-provisioner features and the external-provisioner itself is considered GA and fully supported.

Usage

It is necessary to create a new service account and give it enough privileges to run the external-provisioner, see deploy/kubernetes/rbac.yaml. The provisioner is then deployed as single Deployment as illustrated below:

kubectl create deploy/kubernetes/deployment.yaml

The external-provisioner may run in the same pod with other external CSI controllers such as the external-attacher, external-snapshotter and/or external-resizer.

Note that the external-provisioner does not scale with more replicas. Only one external-provisioner is elected as leader and running. The others are waiting for the leader to die. They re-elect a new active leader in ~15 seconds after death of the old leader.

Command line options

Recommended optional arguments

Storage capacity arguments

See the storage capacity section below for details.

Distributed provisioning

Other recognized arguments

Design

External-provisioner interacts with Kubernetes by watching PVCs and PVs and implementing the external provisioner protocol. The design document explains this in more detail.

Topology support

When Topology feature is enabled* and the driver specifies VOLUME_ACCESSIBILITY_CONSTRAINTS in its plugin capabilities, external-provisioner prepares CreateVolumeRequest.AccessibilityRequirements while calling Controller.CreateVolume. The driver has to consider these topology constraints while creating the volume. Below table shows how these AccessibilityRequirements are prepared:

Delayed binding Strict topology Allowed topologies Immediate Topology Resulting accessibility requirements
Yes Yes Irrelevant Irrelevant Requisite = Preferred = Selected node topology
Yes No No Irrelevant Requisite = Aggregated cluster topology
Preferred = Requisite with selected node topology as first element
Yes No Yes Irrelevant Requisite = Allowed topologies
Preferred = Requisite with selected node topology as first element
No Irrelevant Yes Irrelevant Requisite = Allowed topologies
Preferred = Requisite with randomly selected node topology as first element
No Irrelevant No Yes Requisite = Aggregated cluster topology
Preferred = Requisite with randomly selected node topology as first element
No Irrelevant No No Requisite and Preferred both nil

*) Topology feature gate is enabled by default since v5.0.

When enabling topology support in a CSI driver that had it disabled, please make sure the topology is first enabled in the driver's node DaemonSet and topology labels are populated on all nodes. The topology can be then updated in the driver's Deployment and its external-provisioner sidecar.

Capacity support

The external-provisioner can be used to create CSIStorageCapacity objects that hold information about the storage capacity available through the driver. The Kubernetes scheduler then uses that information when selecting nodes for pods with unbound volumes that wait for the first consumer.

All CSIStorageCapacity objects created by an instance of the external-provisioner have certain labels:

They get created in the namespace identified with the NAMESPACE environment variable.

Each external-provisioner instance manages exactly those objects with the labels that correspond to the instance.

Optionally, all CSIStorageCapacity objects created by an instance of the external-provisioner can have an owner. This ensures that the objects get removed automatically when uninstalling the CSI driver. The owner is determined with the POD_NAME/NAMESPACE environment variables and the --capacity-ownerref-level parameter. Setting an owner reference is highly recommended whenever possible (i.e. in the most common case that drivers are run inside containers).

If ownership is disabled the storage admin is responsible for removing orphaned CSIStorageCapacity objects, and the following command can be used to clean up orphaned objects of a driver:

kubectl delete csistoragecapacities -l csi.storage.k8s.io/drivername=my-csi.example.com

When switching from a deployment without ownership to one with ownership, managed objects get updated such that they have the configured owner. When switching in the other direction, the owner reference is not removed because the new deployment doesn't know what the old owner was.

To enable this feature in a driver deployment with a central controller (see also the deploy/kubernetes/storage-capacity.yaml example):

To determine how many different topology segments exist, external-provisioner uses the topology keys and labels that the CSI driver instance on each node reports to kubelet in the NodeGetInfoResponse.accessible_topology field. The keys are stored by kubelet in the CSINode objects and the actual values in Node annotations.

CSI drivers must report topology information that matches the storage pool(s) that it has access to, with granularity that matches the most restrictive pool.

For example, if the driver runs in a node with region/rack topology and has access to per-region storage as well as per-rack storage, then the driver should report topology with region/rack as its keys. If it only has access to per-region storage, then it should just use region as key. If it uses region/rack, then redundant CSIStorageCapacity objects will be published, but the information is still correct. See the KEP for details.

For each segment and each storage class, CSI GetCapacity is called once with the topology of the segment and the parameters of the class. If there is no error and the capacity is non-zero, a CSIStorageCapacity object is created or updated (if it already exists from a prior call) with that information. Obsolete objects are removed.

To ensure that CSIStorageCapacity objects get removed when the external-provisioner gets removed from the cluster, they all have an owner and therefore get garbage-collected when that owner disappears. The owner is not the external-provisioner pod itself but rather one of its parents as specified by --capacity-ownerref-level. This way, it is possible to switch between external-provisioner instances without losing the already gathered information.

CSIStorageCapacity objects are namespaced and get created in the namespace of the external-provisioner. Only CSIStorageCapacity objects with the right owner are modified by external-provisioner and their name is generated, so it is possible to deploy different drivers in the same namespace. However, Kubernetes does not check who is creating CSIStorageCapacity objects, so in theory a malfunctioning or malicious driver deployment could also publish incorrect information about some other driver.

The deployment with distributed provisioning is almost the same as above, with some minor change:

Deployments of external-provisioner outside the Kubernetes cluster are also possible, albeit only without an owner for the objects. NAMESPACE still needs to be set to some existing namespace also in this case.

CSI error and timeout handling

The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by --timeout command line argument (15 seconds by default).

Correct timeout value and number of worker threads depends on the storage backend and how quickly it is able to process ControllerCreateVolume and ControllerDeleteVolume calls. The value should be set to accommodate majority of them. It is fine if some calls time out - such calls will be retried after exponential backoff (starting with 1s by default), however, this backoff will introduce delay when the call times out several times for a single volume.

Frequency of ControllerCreateVolume and ControllerDeleteVolume retries can be configured by --retry-interval-start and --retry-interval-max parameters. The external-provisioner starts retries with retry-interval-start interval (1s by default) and doubles it with each failure until it reaches retry-interval-max (5 minutes by default). The external provisioner stops increasing the retry interval when it reaches retry-interval-max, however, it still retries provisioning/deletion of a volume until it's provisioned. The external-provisioner keeps its own number of provisioning/deletion failures for each volume.

The external-provisioner can invoke up to --worker-threads (100 by default) ControllerCreateVolume and up to --worker-threads (100 by default) ControllerDeleteVolume calls in parallel, i.e. these two calls are counted separately. The external-provisioner assumes that the storage backend can cope with such high number of parallel requests and that the requests are handled in relatively short time (ideally sub-second). Lower value should be used for storage backends that expect slower processing related to newly created / deleted volumes or can handle lower amount of parallel calls.

Details of error handling of individual CSI calls:

HTTP endpoint

The external-provisioner optionally exposes an HTTP endpoint at address:port specified by --http-endpoint argument. When set, these two paths are exposed:

Deployment on each node

Normally, external-provisioner is deployed once in a cluster and communicates with a control instance of the CSI driver which then provisions volumes via some kind of storage backend API. CSI drivers which manage local storage on a node don't have such an API that a central controller could use.

To support this case, external-provisioner can be deployed alongside each CSI driver on different nodes. The CSI driver deployment must:

Usage of --strict-topology and --immediate-topology=false is recommended because it makes the CreateVolume invocations simpler. Topology information is always derived exclusively from the information returned by the CSI driver that runs on the same node, without combining that with information stored for other nodes. This works as long as each node is in its own topology segment, i.e. usually with a single topology key and one unique value for each node.

Volume provisioning with late binding works as before, except that each external-provisioner instance checks the "selected node" annotation and only creates volumes if that node is the one it runs on. It also only deletes volumes on its own node.

Immediate binding is also supported, but not recommended. It is implemented by letting the external-provisioner instances assign a PVC to one of them: when they see a new PVC with immediate binding, they all attempt to set the "selected node" annotation with their own node name as value. Only one update request can succeed, all others get a "conflict" error and then know that some other instance was faster. To avoid the thundering herd problem, each instance waits for a random period before issuing an update request.

When CreateVolume call fails with ResourcesExhausted, the normal recovery mechanism is used, i.e. the external-provisioner instance removes the "selected node" annotation and the process repeats. But this triggers events for the PVC and delays volume creation, in particular when storage is exhausted on most nodes. Therefore external-provisioner checks with GetCapacity before attempting to own a PVC whether the currently available capacity is sufficient for the volume. When it isn't, the PVC is ignored and some other instance can own it.

The --node-deployment-base-delay parameter determines the initial wait period. It also sets the jitter, so in practice the initial wait period will be in the range from zero to the base delay. If the value is high, volumes with immediate binding get created more slowly. If it is low, then the risk of conflicts while setting the "selected node" annotation increases and the apiserver load will be higher.

There is an exponential backoff per PVC which is used for unexpected problems. Normally, an owner for a PVC is chosen during the first attempt, so most PVCs will use the base delays. A maximum can be set with --node-deployment-max-delay anyway, to avoid very long delays when something went wrong repeatedly.

During scale testing with 100 external-provisioner instances, a base delay of 20 seconds worked well. When provisioning 3000 volumes, there were only 500 conflicts which the apiserver handled without getting overwhelmed. The average provisioning rate of around 40 volumes/second was the same as with a delay of 10 seconds. The worst-case latency per volume was probably higher, but that wasn't measured.

Note that the QPS settings of kube-controller-manager and external-provisioner have to be increased at the moment (Kubernetes 1.19) to provision volumes faster than around 4 volumes/second. Those settings will eventually get replaced with flow control in the API server itself.

Beware that if no node has sufficient storage available, then also no CreateVolume call is attempted and thus no events are generated for the PVC, i.e. some other means of tracking remaining storage capacity must be used to detect when the cluster runs out of storage.

Because PVCs with immediate binding get distributed randomly among nodes, they get spread evenly. If that is not desirable, then it is possible to disable support for immediate binding in distributed provisioning with --node-deployment-immediate-binding=false and instead implement a custom policy in a separate controller which sets the "selected node" annotation to trigger local provisioning on the desired node.

Deleting local volumes after a node failure or removal

When a node with local volumes gets removed from a cluster before deleting those volumes, the PV and PVC objects may still exist. It may be possible to remove the PVC normally if the volume was not in use by any pod on the node, but normal deletion of the volume and thus deletion of the PV is not possible anymore because the CSI driver instance on the node is not available or reachable anymore and therefore Kubernetes cannot be sure that it is okay to remove the PV.

When an administrator is sure that the node is never going to come back, then the local volumes can be removed manually:

It may also be necessary to scrub disks before reusing them because the CSI driver had no chance to do that.

If there still was a PVC which was bound to that PV, it then will be moved to phase "Lost". It has to be deleted and re-created if still needed because no new volume will be created for it. Editing the PVC to revert it to phase "Unbound" is not allowed by the Kubernetes API server.

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.