storageos / storageos.github.io

Public documentation for StorageOS, persistent storage for Docker and Kubernetes
https://docs.storageos.com
16 stars 17 forks source link

StorageOS can't create a volume in 2-node on-premise Kubernetes cluster #196

Closed zhukovsd closed 5 years ago

zhukovsd commented 6 years ago

Hello.

What happened:

  1. Create a volume with storageos volume create test-volume
  2. This volume fails to mount with storageos volume mount test-volume /test
  3. Kubernetes also fails to mount with via PersistentVolume+PersistentVolume claim (I reproduced that by following "Pre-provisioned Persistent volumes" guide in StorageOS docs - https://docs.storageos.com/docs/install/kubernetes/preprovisioned)

What you expected to happen: I anticipated that StorageOS volume would operate as expected or StorageOS CLI would report that's wrong.

How to reproduce it (as minimally and precisely as possible): I'm not sure since the problem is probably caused either by hardware, my Kubernetes cluster or another external factor. Please see below for more details about the environment.

Anything else we need to know?: I run 2-node on-premise Kubernetes cluster. The scheduler does not schedule any pods for the master node, so the only active node is a single worker node. Thus, StorageOS DaemonSet contains one pod:

NAME        DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
storageos   1         1         1         1            1           <none>          2d

Also, when I run storageos version on the master node, it fails with Get http://storageos-cluster/version: failed to dial all known cluster members, (127.0.0.1:5705) error because master node itself does not run StorageOS pod. So I run all storageos CLI commands from the worker node shell.

storageos cluster health does not display any issues:

root@node1 ~ # storageos cluster health
NODE   ADDRESS       CP_STATUS  DP_STATUS
node1  159.69.116.1  Healthy    Healthy

storageos volume ls displays volume status as active:

root@node1 ~ # storageos volume ls
NAMESPACE/NAME       SIZE  MOUNT  SELECTOR  STATUS  REPLICAS  LOCATION
default/test1        15GB                   active  0/0       node1 (healthy)

While playing with StorageOS interactive demos on kotacoda - http://play.storageos.com, I noted that StorageOS creates a folder in /var/lib/storageos/volumes for every created volume.

On my machine, this folder is empty. Note drwxrwxrwx permissions for volumes folder. I set it in order to validate that it is not the reason for StorageOS to not being able to create a volume.

root@node1 ~ # ls -la /var/lib/storageos/volumes/
total 8
drwxrwxrwx 2 root root 4096 Aug 24 00:20 .
drwxr-xr-x 8 root root 4096 Aug 24 19:49 ..

Also, I found that after calling storageos volume create, /var/lib/storageos/logs/storageos.log contains a number of logging entries, one of which with level=error:

time="2018-08-24T22:38:54Z" level=error msg="filesystem client: presentation create failed" action=create error="<nil>" module=statesync reason="Create refused by validator" volume_uuid=0a8f6232-c0f5-9ed1-a213-5669ed9533ae                                                                                      
time="2018-08-24T22:38:54Z" level=info msg="virtual bool FsConfig::PresentationEventSemantics::Validate(event_type): Not adding pr_filename '0a8f6232-c0f5-9ed1-a213-5669ed9533ae' for volume 238534 - already exists for volume 238534 category=fscfg level=warn" category=fscfg module=supervisor                 
time="2018-08-24T22:38:54Z" level=info msg="validator 'device_validator' rejected Event{type CREATE} category=libcfg level=warn" category=libcfg module=supervisor

I found a StackOverflow discussion which seems to be relevant - https://stackoverflow.com/questions/51292759/rancher-kubernetes-and-storageos-persistent-storage-volume-mount-issue Although, this thread does not contain any solution.

Environment:

Please let me know if any other details could be useful. I also posted the same issue in Kubernetes Github, in case if this problem is caused by Kubernetes, and not StorageOS.

Thanks.

Arau commented 6 years ago

Hi @zhukovsd

Thank you for such a detailed report. What your are explaining sounds very much as mount propagation issues as indicates that /var/lib/storageos/volumes is empty. However, 1.10+ enables mount propagation by default and you enabled that for Docker too. Another reason might depend on how the Kubelet is running. Does the kubelet runs as a container or as a process?

Can you run a storageos volume inspect default/test1 and post the output please? Also, did you try to create a second volume?

Finally would you please check if the kernel module tcm_loop is installed? lsmod | grep tcm_loop

Could you also tell us if you are using an installation with CSI or without?

If you find that posting some outputs might be inappropriate, you can open a ticket case sending an email to support@storageos.com and I'll add the rest of your information.

Kind regards

zhukovsd commented 6 years ago

Hi @Arau

Output of storageos volume inspect default/test1:

root@node1 ~ # storageos volume inspect default/test1            
[                                                                
    {                                                            
        "id": "d6cbc394-99ab-18be-ec77-2c3e811aaba4",            
        "inode": 7674,                                           
        "name": "test1",                                         
        "size": 15,                                              
        "pool": "default",                                       
        "fsType": "ext4",                                        
        "description": "",                                       
        "labels": {},                                            
        "namespace": "default",                                  
        "nodeSelector": "",                                      
        "master": {                                              
            "id": "06d43006-6a3f-9538-755a-250e1cdb5610",        
            "inode": 155187,                                     
            "node": "e3538fe5-580a-8c71-20a3-3cfea8d4af8d",      
            "nodeName": "node1",                                 
            "controller": "",                                    
            "controllerName": "",                                
            "health": "healthy",                                 
            "status": "active",                                  
            "createdAt": "2018-08-24T20:56:08.228946602Z"        
        },                                                       
        "mounted": false,                                        
        "mountDevice": "",                                       
        "mountpoint": "",                                        
        "mountedAt": "2018-08-24T21:04:08.263153845Z",           
        "replicas": [],                                          
        "health": "",                                            
        "status": "active",                                      
        "statusMessage": "",                                     
        "mkfsDone": true,                                        
        "mkfsDoneAt": "2018-08-24T21:01:06.630522068Z",          
        "createdAt": "2018-08-24T20:56:08.220145175Z",           
        "createdBy": "storageos"                                 
    }                                                            
]                                                                

Also, did you try to create a second volume? Yes, the problem I encountered seems to be consistent and occurs for every volume created with storageos volume create.

Output of lsmod | grep tcm_loop:

root@node1 ~ # lsmod | grep tcm_loop
tcm_loop               24576  9
target_core_mod       352256  32 target_core_iblock,tcm_loop,target_core_user,target_core_file,target_core_pscsi
scsi_mod              225280  8 sd_mod,virtio_scsi,tcm_loop,target_core_mod,libata,sr_mod,sg,target_core_pscsi

If it would be helpful for debugging purposes or any other further investigation, let me explain a way to reproduce the problem:

  1. Create an account in Hetzner cloud provider - https://console.hetzner.cloud
  2. Provision 2 machines of CX11 configuration (1 vCPU, 2 GB RAM, 20 GB SSD), for OS, choose Debian 9
  3. Set up 2-node Kubernetes cluster with Kubespray
  4. Install StorageOS on top of that cluster as explained here - https://docs.storageos.com/docs/install/kubernetes/index
  5. Create a new volume with storageos volume create. Then, inspect /var/lib/storageos/volume directory and /var/lib/storageos/logs/storageos.log
Arau commented 5 years ago

Hi @zhukovsd,

meanwhile we reproduce with kubespray, would you mind checking one more thing that can tell us if the issue is with the k8s integration or internal.

Can you exec into the storageos container to see if the device files (volume dir) are present inside the container?

POD=storageos-fn9pf # Set the storageos pod name 
kubectl -n storageos exec  $POD -it  -- ls -l --color /var/lib/storageos/volumes

Regards

zhukovsd commented 5 years ago

Hi @Arau

It seems like kubectl -n storageos exec $POD -it -- ls -l --color /var/lib/storageos/volumes is not exactly correct, I changed it to kubectl exec -it $POD -- ls -l --color /var/lib/storageos/volumes in order to access StorageOS $POD and execute ls /var/lib/storageos/volumes inside of it.

Here is my output:

root@master ~ # kubectl exec -it $POD -- ls -l --color /var/lib/storageos/volumes
total 11010048
brw-rw---- 1 0 6       8, 16 Aug 30 10:51 060c2361-89f2-ea07-fc46-ae11e46d2620
-rw-rw---- 1 0 6 16106127360 Aug 30 10:51 06d43006-6a3f-9538-755a-250e1cdb5610
brw-rw---- 1 0 6       8, 32 Aug 30 10:51 0a8f6232-c0f5-9ed1-a213-5669ed9533ae
-rw-rw---- 1 0 6  1073741824 Aug 30 10:51 356cbec3-0734-226a-96e6-94c568431c8c
-rw-rw---- 1 0 6  5368709120 Aug 30 10:51 535fe1f5-e058-8523-38bd-8533953a58fc
-rw-rw---- 1 0 6 16106127360 Aug 30 10:51 5c097377-887c-fea7-7798-0f76cc7323c2
-rw-rw---- 1 0 6  5368709120 Aug 30 10:51 b2dd48e1-21c8-77d0-bf10-aa3bce36fc6c
-rw-rw---- 1 0 6  1073741824 Aug 30 10:51 be26300d-56dc-f275-c7b9-599df96dbb53
-rw-rw---- 1 0 6  5368709120 Aug 30 10:51 bst-130492
-rw-rw---- 1 0 6  5368709120 Aug 30 10:51 bst-147282
-rw-rw---- 1 0 6  1073741824 Aug 30 10:51 bst-163119
-rw-rw---- 1 0 6  1073741824 Aug 30 10:51 bst-204320
-rw-rw---- 1 0 6 16106127360 Aug 30 10:51 bst-7674
-rw-rw---- 1 0 6 16106127360 Aug 30 10:51 bst-86840
brw-rw---- 1 0 6       8, 48 Aug 30 10:51 c413d248-dfc0-5c81-ca40-646ea658f7a1
brw-rw---- 1 0 6       8, 64 Aug 30 10:51 d6cbc394-99ab-18be-ec77-2c3e811aaba4
brw-rw---- 1 0 6       8, 80 Aug 30 10:51 f0d7d8d3-052e-6575-ecf1-255c533bac7b
brw-rw---- 1 0 6       8, 96 Aug 30 10:51 f726e3dc-03ee-c4ef-4084-1dfe5fc5fb86

In addition, this is my output for storageos volume ls, if this is useful:

root@master ~ # storageos volume ls
NAMESPACE/NAME       SIZE  MOUNT  SELECTOR  STATUS  REPLICAS  LOCATION
default/db-storage1  1GB                    active  0/0       node1 (healthy)
default/redis-vol01  1GB                    active  0/0       node1 (healthy)
default/test1        15GB                   active  0/0       node1 (healthy)
default/test2        15GB                   active  0/0       node1 (healthy)
default/test3        5GB                    failed  0/1       -
default/test4        5GB                    active  0/0       node1 (healthy)
default/test5        5GB                    active  0/0       node1 (healthy)
default/test6        5GB                    failed  0/0       -
Arau commented 5 years ago

Hi @zhukovsd,

Because of the output on your last message we can totally say that the issue you see is because MountPropagation is not enabled in your cluster. The fact that the device files are present inside the StorageOS container but not outside (in /var/lib/storageos/volumes) of the host is clear prove.

We have been investigating why is this happening even you mentioned that MountPropagation is enabled. After K8S 1.10+ that feature gate was enabled by default, however I've seen that KubeSpray default installation set that to false. When running a ps auxwwwf and check the kubelet args I can see --feature-gates=PersistentLocalVolumes=False,VolumeScheduling=False,MountPropagation=False as a parameter, hence MountPropagation is set to False.

To enable it, it is required to set the ansible config file for kubespray inventory/mycluster/group_vars/k8s-cluster.yml accordingly. The option local_volume_provisioner_enabled has to be set to true and redeploy the playbook.

local_volume_provisioner_enabled: true

After that, the ps output is as follows: --feature-gates=PersistentLocalVolumes=True,VolumeScheduling=True,MountPropagation=True

zhukovsd commented 5 years ago

Hi @Arau

I enabled local_volume_provisioner_enabled: true option in Kubespray and redeployed the cluster. Also, I re-installed StorageOS from scratch:

# Install StorageOS as a daemonset with CSI and RBAC support
git clone https://github.com/storageos/deploy.git storageos
cd storageos/k8s/deploy-storageos/CSI
./deploy-storageos.sh

It worked nicely.

In the hindsight, I got stuck with this problem because I did not expect Kubespray to disable mount propagation while Kubernetes itself would enable it by default.

It would be great if StorageOS would point on that in a more explicit way. For example:

Thanks for your help.