kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.29k stars 1.54k forks source link

Enable Simulation of automatically provisioned ReadWriteMany PVs #1487

Open joshatcaper opened 4 years ago

joshatcaper commented 4 years ago

What would you like to be added: A method to provide automatically provisioned ReadeWriteMany PVs that are available on all workers.

Currently the storage provisioner that is being used can only provision

Why is this needed: The current volume provisioner that is being only supports creating ReadWriteOnce volumes. This is because kind is using the rancher local-path-provisioner and they hard code their provisioner to disallow any PVCs with an access mode other than ReadWriteOnce. Many managed kubernetes providers supply some type of distributed file system. I'm currently using Azure Storage File (which is SMB/cifs under the hood) for this use case in production. Google's Kubernetes Engine offers ReadOnlyMany out of the box.

Possible solutions: Could we have the control plane node start up an NFS container backed by a ReadWriteOnce?

Thanks for your time!

BenTheElder commented 4 years ago

NFS from an overlayfs requires a 4.15+ kernel IIRC. Currently kind imposes no additional requirements on kernel version beyond what kubernetes does upstream.

I don't think we want to start imposing any kernel requirement yet, or the overhead of running & managing NFS by default.

kind of course supports installing additional drivers, preferably with CSI.

IMHO it makes more sense to run this as an addon. cc @msau42 @pohly.

We can discuss other Read* modes upstream in the rancher project.

BenTheElder commented 4 years ago

Ah, I hadn't had a need for RWM. https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes

even ReadOnlyMany is going to require some kind of network attached storage or something, since the "many" is nodes not pods (my mistake)

I don't think rancher / local storage is going to do read across nodes :sweat_smile:

probably the best solution here is to document some yaml to apply for getting an NFS provisioner installed on top of a standard kind cluster.

joshatcaper commented 4 years ago

@BenTheElder ah, didn't know NFS required a newer kernel in this instance. Would it be possible to do something similar with docker volumes instead? The following docker-compose example should back the containers with a shared volume that is consistent-ish:

version: "2.3"
services:
  control-plane0:
    image: k8s.gcr.io/pause
    volumes:
      - rwmpvc:/rwmpvc
  worker0:
    image: k8s.gcr.io/pause
    volumes:
      - rwmpvc:/rwmpvc
  worker1:
    image: k8s.gcr.io/pause
    volumes:
      - rwmpvc:/rwmpvc

volumes:
  rwmpvc:

The ouput of docker-compose up && docker container inspect <container> will show:

        ...
        "Mounts": [
            {
                "Type": "volume",
                "Name": "test_rwmpvc",
                "Source": "/var/lib/docker/volumes/test_rwmpvc/_data",
                "Destination": "/rwmpvc",
                "Driver": "local",
                "Mode": "rw",
                "RW": true,
                "Propagation": ""
            }
        ],
        ...

Using an approach like this would not require any NFS server to be run internally in the containers. The PV provisioner just needs to consistently derive the host path in a similar way like /rwmpvc/<uuid> on each host.

BenTheElder commented 4 years ago

This will work in backends where the nodes are all on a single machine (which we may not guarantee in the future) IF we write a custom provisioner.

IMHO it's better to just provide an opt-in NFS solution you can deploy and document it.

It should just be a kubectl apply away from installing an NFS provisioner as long as you have an updated kernel.

msau42 commented 4 years ago

Agree, I think a opt-in NFS tutorial would be the best option here for users that need it.

We don't have any great options from sig-storage perspective, most solutions already assume you have an nfs server setup somewhere.

joshatcaper commented 4 years ago

I don't know if this is possible but would there be some way to abstract this from the end user using some method of packaging and enabling "addons" similar to minikube? I don't know about the long term goals of kind but from an outsiders perspective it seems like a wonderful way to deploy an ephemeral copy of software in a CI stage. I was investigating it as a method to run some end-to-end integration testing on my company's software. I'd really like it if the configurations I end up applying to the created cluster very closely match what I'd push to a real cluster otherwise I'd be worried about running into the same issues you hit when you build a "dev" and "production" version of a binary and only test against your "dev" builds, never your production build.

I don't know if addons are a clean way of accomplishing this goal but I think the utility of kind for the in-CI-deployment workflow would greatly be helped by something that completely hides that this isn't a real managed kube cluster from the end user. Obviously, though, having some way to do this is better than having no way of doing this.

Interested in your thoughts.

BenTheElder commented 4 years ago

I don't know if this is possible but would there be some way to abstract this from the end user using some method of packaging and enabling "addons" similar to minikube?

Hi, regarding addons: we're not bundling addons at this time.

That approach tends to be problematic for users as it couples the lifecycle of the addons to the version of the cluster tool.

SIG Cluster Lifecycle seems to agree and the future of addon work there seems to be the cluster addons project, which involves a generic system on top of any cluster. We're tracking that work and happy to integrate when it's ready https://github.com/kubernetes-sigs/kind/issues/253

In the meantime addons tend to not be any different from any other cluster workload, they can be managed with kubectl, helm, kustomize, kpt, etc.

For an example of a more involved "addon" that isn't actually bundled with kind config dependencies see https://kind.sigs.k8s.io/docs/user/ingress/

I don't know about the long term goals of kind but from an outsiders perspective it seems like a wonderful way to deploy an ephemeral copy of software in a CI stage.

This gives a rough idea where our priorities are at, which do include supporting this more or less https://kind.sigs.k8s.io/docs/contributing/project-scope/

I was investigating it as a method to run some end-to-end integration testing on my company's software. I'd really like it if the configurations I end up applying to the created cluster very closely match what I'd push to a real cluster otherwise I'd be worried about running into the same issues you hit when you build a "dev" and "production" version of a binary and only test against your "dev" builds, never your production build.

We have a KubeCon talk about this: https://kind.sigs.k8s.io/docs/user/resources/#testing-your-k8s-apps-with-kind--benjamin-elder--james-munnelly

I don't know if addons are a clean way of accomplishing this goal but I think the utility of kind for the in-CI-deployment workflow would greatly be helped by something that completely hides that this isn't a real managed kube cluster from the end user. Obviously, though, having some way to do this is better than having no way of doing this.

Clusters have a standard API in KUBECONFIG and the API endpoint. Unfortunately for portability reasons we can't quite hide that this isn't the same as your real cluster, a lot of extension points break down here including but not limited to:

For these you'll want to provide your own wrapper of some sort to ensure that the kind cluster matches your prod more closely (e.g. mimicking the custom storage classes from your prod cluster, trying to run a similar or the same ingress..)

BenTheElder commented 4 years ago

nfs-common will be installed on the nodes going forward which should enable NFS volumes. you still need to run an NFS server somehow.

BenTheElder commented 4 years ago

(also confirmed that it works, the kubernetes NFS e2e tests pass)

BenTheElder commented 4 years ago

requires 4.16 kernel https://www.phoronix.com/scan.php?page=news_item&px=OverlayFS-NFS-Export-Linux-4.16

danquah commented 4 years ago

Just did a verification of this feature.

I first made sure kubernetes was cloned to ${GOPATH}/src/k8s.io/kubernetes as described in https://kind.sigs.k8s.io/docs/user/working-offline/#prepare-kubernetes-source-code

I then built my own node-image using the latest base-image with nfs-common via the following (takes a while!)

kind build node-image --image kindest/node:master --base-image kindest/base:v20200610-99eb0617 --kube-root "${GOPATH}/src/k8s.io/kubernetes"

Next i created a cluster using the new node-image via

kind create cluster --config kind-config.yaml

Using the following kind-config.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:master

I then pulled and loaded the nfs-provisioner image to prepare for installation

docker pull quay.io/kubernetes_incubator/nfs-provisioner
kind  load docker-image quay.io/kubernetes_incubator/nfs-provisioner

The provisioner could then be installed via Helm (Helm was installed separately).

helm repo add stable https://kubernetes-charts.storage.googleapis.com/
helm install nfs-provisioner stable/nfs-server-provisioner 

And I was then finally able to to provision a NFS volume via the following PVC

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-dynamic-volume-claim
spec:
  storageClassName: "nfs"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Mi

Everything worked like a charm - looking forward to the next Kind release :)

LordNoteworthy commented 4 years ago

Nice ! I am currently looking for this. When this will be released?

BenTheElder commented 4 years ago

0.9.0 delayed for various reasons. we'll re-evaluate and set a new target date soon.

On Fri, Jul 24, 2020 at 5:17 PM Noteworthy notifications@github.com wrote:

Nice ! I am currently looking for this. When this will be released?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kind/issues/1487#issuecomment-663781048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADKYJIF6PC4KE2MGE2OLR5IP75ANCNFSM4MKMFTTQ .

danquah commented 4 years ago

@BenTheElder any updates on the new target date? Trying to determine whether to base some internal setup on our own build of kind or whether there will be a release in the near future we can use instead.

BenTheElder commented 4 years ago

Sorry I missed this comment (sweeping issues now), v0.9.0 was re-scheduled to match k8s v1.19 but some last minute fixes are still pending so we didn't cut the release today (k8s did). I expect to have those merged by tomorrow.

koxu1996 commented 3 years ago

This is side note, but might be useful for someone. When I updated node image from 18.8 to 19.1 then NFS helm chart does not work properly: memory is filled up in few seconds. I investigated the problem and it seems rpc.statd from nfs-utils package is outdated and it is leaking memory.

BenTheElder commented 3 years ago

that's unfortunate. we're shipping the latest available in the distro at the moment (ubuntu 20.10), if it's fixed in ubuntu we'll pick it up in a future kind image.

koxu1996 commented 3 years ago

@BenTheElder Now I think it might be something different. That's how I reproduce issue:

$ kind create cluster --image [NODE_IMAGE]
$ helm install stable/nfs-server-provisioner --generate-name
# wait 30s until 100% memory is filled up

Issue is present when I use most recent node images:

List of node images that works without problem:

Note: I tried building latest node image from kind:v0.9.0 sources and it works fine :confused:

BenTheElder commented 3 years ago

1.19.0 isn't a latest image (please see the kind release notes as usual) and all of the images that are current were built with the same version, there were no changed to the base image or node image build process between those builds and tagging the release.

koxu1996 commented 3 years ago

@BenTheElder Sorry, I pasted corrected digest 98cf52888646, but lower version - it should be latest v1.19.1:

$ docker pull kindest/node:v1.19.1
v1.19.1: Pulling from kindest/node
Digest: sha256:98cf5288864662e37115e362b23e4369c8c4a408f99cbc06e58ac30ddc721600
Status: Image is up to date for kindest/node:v1.19.1
docker.io/kindest/node:v1.19.1

So issue is present for latest node image. I am trying to track down what was changed during latest node images update.

aojea commented 3 years ago

I' m almost sure is because of this https://github.com/kubernetes-sigs/kind/pull/1799

but I keep thinking that is an nfs bug :smile: https://github.com/kubernetes-sigs/kind/pull/760#issuecomment-519299299

@koxu1996 you should limit the filedescriptor at the OS level

koxu1996 commented 3 years ago

@aojea Indeed, I bisected KinD commits and this is the culprit: https://github.com/kubernetes-sigs/kind/commit/2f17d2532084a11472bb464ccdc1285caa7c4583.

I use Arch BTW :laughing: and kernel-limit of file descriptors is really high:

$ sudo sysctl -a | grep "fs.nr_open"
fs.nr_open = 1073841816

To workaround the NFS issue you can change kernel-level limits, eg.

sudo sysctl -w fs.nr_open=1048576

or you could use custom node image.

Edit:

I asked nfs-utils maintainer about this bug and got following reply:

This was fixed by the following libtirpc commit:

commit e7c34df8f57331063b9d795812c62cec3ddfbc17 (tag: libtirpc-1-2-7-rc3) Author: Jaime Caamano Ruiz jcaamano@suse.com Date: Tue Jun 16 13:00:52 2020 -0400

libtirpc: replace array with list for per-fd locks

Which is in the latest RC release libtirpc-1-2-7-rc4

BenTheElder commented 3 years ago

looks like libtirpc is not packaged yet. I'm not sure how we want to proceed here

BenTheElder commented 3 years ago

I think we should try to make sure libtirpc is updated and document how to setup an NFS provisioner, I'm not sure if this is in scope to have in the default setup, but it's certainly in scope to put a guide on the site.

backtrackshubham commented 3 years ago
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-dynamic-volume-claim
spec:
  storageClassName: "nfs"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Mi

This fails with an error saying,

Finished building Kubernetes
Building node image ...
Building in kind-build-1623416466-881282865
Image build Failed! Failed to pull Images: command "docker exec --privileged kind-build-1623416466-881282865 cat /etc/containerd/config.toml" failed with error: exit status 1
ERROR: error building node image: command "docker exec --privileged kind-build-1623416466-881282865 cat /etc/containerd/config.toml" failed with error: exit status 1
Command Output: cat: /etc/containerd/config.toml: No such file or directory
BenTheElder commented 3 years ago

well first of all you should not need to build new node imaiges, we've had multiple releases since https://github.com/kubernetes-sigs/kind/issues/1487#issuecomment-644813581, they already contain all of the changes.

... and the reason that's failing is the base image specified in the command in that comment is very outdated versus current kind. you can skip all the image building steps, NFS should just work now, we run NFS tests in CI. There's no changes to kind needed, just the cluster objects installed at runtime for your NFS service / PVs.

backtrackshubham commented 3 years ago

Hey @BenTheElder thanks for the comment, but when I tried using the storage class nfs it went into the pending state describeing the pvc showed that it doesn't have the storage class "nfs", I understand that you have suggested to run a nfs server some where, but my question is in current version of kind can we do (after having my nfs server) pvc's with access mode ReadWriteMany, I went through the issues inorder to find something on this but was not able to find, any help or suggestions would be wonderful

BenTheElder commented 3 years ago

but when I tried using the storage class nfs it went into the pending state describeing the pvc showed that it doesn't have the storage class "nfs"

yes, we don't have the storage class because that has to refer to a specific NFS setup, and that's something you can choose and install at runtime

I understand that you have suggested to run a nfs server some where,

yes, https://github.com/kubernetes-sigs/kind/issues/1487#issuecomment-644813581 starting from "I then pulled and loaded the nfs-provisioner image to prepare for installation" is still relevant as one approach. The part before that with the custom image is not.

but my question is in current version of kind can we do (after having my nfs server) pvc's with access mode ReadWriteMany,

Yes, in any version nfs has readwritemany, it's just that NFS could not work in a nested container environment when the project was started (issues in the linux kernel actually, not in kind itself). It can now. (see also: https://github.com/kubernetes-sigs/kind/issues/1806)

I don't specifically work with this, but NFS in kind is not special (versus another cluster tool) anymore.

We just need someone to document doing this.

backtrackshubham commented 3 years ago

Thanks I am still in a phase of understanding and learning about K8, and many thanks to the devs and contributors of kind , I will see how to do it thanks, 😃

backtrackshubham commented 3 years ago

Thanks I am still in a phase of understanding and learning about K8, and many thanks to the devs and contributors of kind , I will see how to do it thanks, 😃

Hi @BenTheElder so thanks for all the guidance and ideas I was successfully able to deploy a NFS server with mode RWM using the steps that you and other devs indicated on a Linux system, but now when I am trying to move the same setup on a Mac ( Docker desktop) , I could see that the pos for nfs provisioner is failling with (upon describing)

Warning  FailedScheduling  33m (x2 over 33m)   default-scheduler  0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

And then it eventually gets into crash loop, I found this answer suggesting some change but I would like to understand what exactly have changed between the two systems, could it be because of the resources, as on Linux system the the KInd cluster was flying with 24G ram but here on Mac its 6 CPUs, 4 GB mem 2GB Swap and 200GB HHD,

Thanks

BenTheElder commented 3 years ago

You should also consider running less nodes, kind tries to be as light as possible but kubeadm recommends something like 2gb for node for a more typical cluster IIRC 😅

Kubernetes does not yet use swap effectively, and actually officially requires it to be off, though we set an option to make it run anyhow.

node.kubernetes.io/not-ready is not a taint you should have to remove and kind in general should not require you to manually remove taints ever, this means the nodes are not healthy (which is a very general symptom)

EDIT: If you need more help with that please file a different issue for your case since it's not related to RWM PVs, so folks monitoring this can avoid being notified spuriously, and so we can track your issue directly. We can cross link them for reference. The new issue template also requests useful information for debugging.

meln5674 commented 3 months ago

On the off chance anyone is still watching this, local-path-provisioner has supported RWX volumes for a few releases now, and with v0.0.27 now supports multiple storage classes with a single deployment.

Unless I've overlooked something, I think it should be reasonable to automatically create a RWX storage class for single-node clusters. To support multi-node clusters, that could be accomplished by mounting the same host volume to the same location in each node container, and that could be provided by a new field in the configuration. This would even support future multi-host setups if the user is made responsible for mounting network storage at that location on each host out-of-band.

I would be happy to start work on a PR for this if the idea isn't rejected out of hand.

mosesdd commented 2 months ago

@meln5674 I created a workaround in my environment for this:

kubectl -n local-path-storage patch configmap local-path-config -p '{"data": {"config.json": "{\n\"sharedFileSystemPath\": \"/var/local-path-provisioner\"\n}"}}'

As long as you use only one node configuration this works totally fine.

See https://github.com/rancher/local-path-provisioner?tab=readme-ov-file#definition for details