oVirt / ovirt-openshift-extensions

Implementation of flexvolume driver and provisioner for oVirt
Apache License 2.0
31 stars 16 forks source link

FailedAttachVolume event during pod creation even when using VM uuid #110

Closed TFK1410 closed 5 years ago

TFK1410 commented 5 years ago

Description A newly created pod with a PV attached doesn't spin up and is stuck in container creating state with the FailedAttachVolume event repeating. This issue is simillar to issue #100 which was supposed to fix this scenario. Steps To Reproduce

  1. Install ovirt-openshift-extensions.
  2. Deploy the storageclass below:
    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
    name: ovirt
    provisioner: ovirt-volume-provisioner
    parameters:
    ovirtStorageDomain: "default"
    ovirtDiskThinProvisioning: "true"
    fsType: ext4
  3. Create an example PVC with the contents below:
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
    name: pv-test
    annotations:
    volume.beta.kubernetes.io/storage-class: ovirt
    spec:
    storageClassName: ovirt
    accessModes:
    - ReadWriteOnce
    resources:
    requests:
      storage: 1Gi
  4. Create any example pod from the service catalog in the same project and mount the PVC to it.
  5. Check the events of the pod in creation:
    oc describe po ovirt-pv-test-5-jfqr6
    Events:
    Type     Reason              Age   From                     Message
    ----     ------              ----  ----                     -------
    Normal   Scheduled           13s   default-scheduler        Successfully assigned sre/ovirt-pv-test-5-jfqr6 to ...
    Warning  FailedAttachVolume  13s   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-cffb6e8c-3a9a-11e9-a47a-001a4a160206" : invalid character 'B' after top-level value
  6. Verify that the VM uuids are used in the flexvolume-driver config:
    url=https://.../ovirt-engine/api
    username=theStorageAdmin@internal
    password=...
    insecure=True
    cafile=
    ovirtVmId=9AB8C3E1-9C3C-448F-9FCB-8567AA33A85D

Expected behavior The pod is successfully deployed. Versions:

rgolangh commented 5 years ago

I need the logs of the kube-controller-master to see the underlying error:

kubectl logs -n kube-system -l component=kube-controller-manager --since=1h

tux-o-matic commented 5 years ago

Working with @TFK1410 on this, I think these could be the useful lines from the controller logs:

I0227 12:57:34.079610       1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" (UniqueName: "flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206") from node "<NODE_NAME>" 
E0227 12:57:34.563498       1 driver-call.go:253] Failed to unmarshal output for command: attach, output: "400 Bad Request", error: invalid character 'B' after top-level value
W0227 12:57:34.563550       1 driver-call.go:144] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ovirt~ovirt-flexvolume-driver/ovirt-flexvolume-driver, args: [attach {"kubernetes.io/fsType":"ext4","kubernetes.io/pvOrVolumeName":"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206","kubernetes.io/readwrite":"rw"} <NODE_NAME>], error: exit status 1, output: "400 Bad Request"
E0227 12:57:34.563713       1 nestedpendingoperations.go:267] Operation for "\"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\"" failed. No retries permitted until 2019-02-27 12:59:36.563671214 +0000 UTC m=+1306764.803048330 (durationBeforeRetry 2m2s). Error: "AttachVolume.Attach failed for volume \"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\" (UniqueName: \"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\") from node \"<NODE_NAME>\" : invalid character 'B' after top-level value"
I0227 12:57:34.563767       1 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"sre", Name:"ovirt-pv-test-1-6xcv4", UID:"eef60fc9-3a8e-11e9-894d-001a4a1601d8", APIVersion:"v1", ResourceVersion:"17303484", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" : invalid character 'B' after top-level value
rgolangh commented 5 years ago

Thanks. I see the error from the api call to ovirt says 'Bad Request' , might be that the node name or id isn't sent to the flexvolume driver request. From the logs I see:

W0227 12:57:34.563550 1 driver-call.go:144] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ovirt~ovirt-flexvolume-driver/ovirt-flexvolume-driver, args: [attach {"kubernetes.io/fsType":"ext4","kubernetes.io/pvOrVolumeName":"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206","kubernetes.io/readwrite":"rw"} ]

Where the NODE_NAME should have the the real node name.

What is the output of:

tux-o-matic commented 5 years ago

The output of oc get nodes matches the node name from the earlier log.

Then the journal from the masters only shows logs from when we deployed the driver, not from the actual attempt to attach a volume:

Feb 27 13:40:17 master-<NAME> origin-node[52498]: I0227 13:40:17.684563   52498 kuberuntime_manager.go:385] No sandbox for pod "ovirt-flexvolume-driver-7ppv8_default(d5e9f9b6-3a8c-11e9-894d-001a4a1601d8)" can be found. Need to start a new one
Feb 27 13:40:18 master-<NAME> origin-node[52498]: I0227 13:40:18.566317   52498 kubelet.go:1865] SyncLoop (PLEG): "ovirt-flexvolume-driver-7ppv8_default(d5e9f9b6-3a8c-11e9-894d-001a4a1601d8)", event: &pleg.PodLifecycleEvent{ID:"d5e9f9b6-3a8c-11e9-894d-001a4a1601d8", Type:"ContainerStarted", Data:"9a378e4c63abd66ab445dd85c7cbe61243e2d6bdf63974445c0d94be5fb25e12"}
rgolangh commented 5 years ago

Lets see the journal on the node then.

tux-o-matic commented 5 years ago

One that node the journal had something in that time range.

Feb 27 13:57:21 <NODE_NAME> origin-node[75113]: E0227 13:57:21.188530   75113 kubelet.go:1599] Unable to mount volumes for pod "ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)": timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]; skipping pod
Feb 27 13:57:21 <NODE_NAME> origin-node[75113]: E0227 13:57:21.188639   75113 pod_workers.go:186] Error syncing pod eef60fc9-3a8e-11e9-894d-001a4a1601d8 ("ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)"), skipping: timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]
Feb 27 13:57:26 <NODE_NAME> origin-node[75113]: I0227 13:57:26.171808   75113 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" (UniqueName: "flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206") pod "ovirt-pv-test-1-6xcv4" (UID: "eef60fc9-3a8e-11e9-894d-001a4a1601d8")
Feb 27 13:57:26 <NODE_NAME> origin-node[75113]: E0227 13:57:26.179196   75113 nestedpendingoperations.go:267] Operation for "\"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\"" failed. No retries permitted until 2019-02-27 13:59:28.179152063 +0100 CET m=+1307744.338005120 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\" (UniqueName: \"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\") pod \"ovirt-pv-test-1-6xcv4\" (UID: \"eef60fc9-3a8e-11e9-894d-001a4a1601d8\") "
Feb 27 13:57:32 <NODE_NAME> origin-node[75113]: W0227 13:57:32.924486   75113 plugin-defaults.go:32] flexVolume driver ovirt/ovirt-flexvolume-driver: using default GetVolumeName for volume pvc-72440cf7-3a8e-11e9-a47a-001a4a160206
Feb 27 13:59:28 <NODE_NAME> origin-node[75113]: I0227 13:59:28.216938   75113 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" (UniqueName: "flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206") pod "ovirt-pv-test-1-6xcv4" (UID: "eef60fc9-3a8e-11e9-894d-001a4a1601d8")
Feb 27 13:59:28 <NODE_NAME> origin-node[75113]: E0227 13:59:28.223512   75113 nestedpendingoperations.go:267] Operation for "\"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\"" failed. No retries permitted until 2019-02-27 14:01:30.223468819 +0100 CET m=+1307866.382321865 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\" (UniqueName: \"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\") pod \"ovirt-pv-test-1-6xcv4\" (UID: \"eef60fc9-3a8e-11e9-894d-001a4a1601d8\") "
Feb 27 13:59:35 <NODE_NAME> origin-node[75113]: E0227 13:59:35.852597   75113 kubelet.go:1599] Unable to mount volumes for pod "ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)": timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]; skipping pod
Feb 27 13:59:35 <NODE_NAME> origin-node[75113]: E0227 13:59:35.852724   75113 pod_workers.go:186] Error syncing pod eef60fc9-3a8e-11e9-894d-001a4a1601d8 ("ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)"), skipping: timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]
Feb 27 13:59:50 <NODE_NAME> origin-node[75113]: W0227 13:59:50.981326   75113 plugin-defaults.go:32] flexVolume driver ovirt/ovirt-flexvolume-driver: using default GetVolumeName for volume pvc-72440cf7-3a8e-11e9-a47a-001a4a160206
rgolangh commented 5 years ago

I suspect your user, theStorageAdmin doesn' have enought permission on ovirt-engine side to perform the attach action. Examin the /var/log/ovirt-engine/engine.log for that. Guessing by the name of the user you did follow the document on how to create the ovirt user, but still the permission side may be missing. For a test, can you try the admin@internal user?

tux-o-matic commented 5 years ago

Indeed the oVirt Engine log shows a permission issue:

2019-02-27 13:58:25,895+01 INFO  [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-2250) [f2aab36c-407e-4d29-a3e3-8c3ebada4dd1] No permission found for user 'b3665e4b-7e6b-41e6-850b-bcc4147392b5' or one of the groups he is member of, when running action 'AttachDiskToVm', Required permissions are: Action type: 'USER' Action group: 'CONFIGURE_VM_STORAGE' Object type: 'VM'  Object ID: 'c7d0b43e-e52a-41e6-a046-109b80eaab67'.
2019-02-27 13:58:25,895+01 WARN  [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-2250) [f2aab36c-407e-4d29-a3e3-8c3ebada4dd1] Validation of action 'AttachDiskToVm' failed for user theStorageAdmin@internal-authz. Reasons: VAR__ACTION__ATTACH_ACTION_TO,VAR__TYPE__DISK,USER_NOT_AUTHORIZED_TO_PERFORM_ACTION
2019-02-27 13:58:25,896+01 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2250) [] Operation Failed: [User is not authorized to perform this action.]

If I look at the allowed actions for the StorageAdmin role, the closest I can find are:

rgolangh commented 5 years ago

The role is fine, we just need it to be applied on the object which includes the vms and storage domains. In your case you need to login as admin@internal for example to ovirt-engine, go to the relevant cluster and grant 'theStorageAdmin' user the role of storageAdmin on that cluster (the cluster that the vms are running on)

tux-o-matic commented 5 years ago

The user already has that role. But is that oVirt cluster specific? If it was missing that role, how else could it be able to create virtual disks in the storage domain?

rgolangh commented 5 years ago

3 things form a permission - user + role + object. Make sure the role is applied on the data-center or cluster on that user. You already had the disk created, now it needs to be attached - it is a different pemission requirement.

tux-o-matic commented 5 years ago

The user is already listed under "permissions" for the oVirt datacenter. Are we talking about permissions as defined the Ansible playbook:

    permissions:
      - state: present
        user_name: theAdmin
        authz_name: internal-authz
        role: StorageAdmin
        object_type: data_center
        object_name: Default

In that context what is Default? We have one user theStorageAdmin, it has the StorageAdmin role and that user is assigned to the oVirt datacenter where are VMs are running. Shouldn't that cover it? At least creating virtual disks in the storage domain works.

rgolangh commented 5 years ago

Ok I see what is missing - the user should have one more role in order to successfully attach disk to VMs. All you need is to add another permission with role 'UserVmManager' , so the whole permission snippet should look like this:

    permissions:

      - state: present
        user_name: theAdmin
        authz_name: internal-authz
        role: StorageAdmin
        object_type: data_center
        object_name: Default

      - state: present
        user_name: theAdmin 
        authz_name: internal-authz
        role: UserVmManager 
        object_type: data_center
        object_name: Default

And yes, 'Default' is the name of the Data Center you will apply the permissions on to.

TFK1410 commented 5 years ago

I can confirm that adding the UserVmManager permission to the oVirt user has fixed the issue. The volumes are now properly mounting to the pods.

rgolangh commented 5 years ago

@TFK1410 thank you