Closed TFK1410 closed 5 years ago
I need the logs of the kube-controller-master to see the underlying error:
kubectl logs -n kube-system -l component=kube-controller-manager --since=1h
Working with @TFK1410 on this, I think these could be the useful lines from the controller logs:
I0227 12:57:34.079610 1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" (UniqueName: "flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206") from node "<NODE_NAME>"
E0227 12:57:34.563498 1 driver-call.go:253] Failed to unmarshal output for command: attach, output: "400 Bad Request", error: invalid character 'B' after top-level value
W0227 12:57:34.563550 1 driver-call.go:144] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ovirt~ovirt-flexvolume-driver/ovirt-flexvolume-driver, args: [attach {"kubernetes.io/fsType":"ext4","kubernetes.io/pvOrVolumeName":"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206","kubernetes.io/readwrite":"rw"} <NODE_NAME>], error: exit status 1, output: "400 Bad Request"
E0227 12:57:34.563713 1 nestedpendingoperations.go:267] Operation for "\"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\"" failed. No retries permitted until 2019-02-27 12:59:36.563671214 +0000 UTC m=+1306764.803048330 (durationBeforeRetry 2m2s). Error: "AttachVolume.Attach failed for volume \"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\" (UniqueName: \"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\") from node \"<NODE_NAME>\" : invalid character 'B' after top-level value"
I0227 12:57:34.563767 1 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"sre", Name:"ovirt-pv-test-1-6xcv4", UID:"eef60fc9-3a8e-11e9-894d-001a4a1601d8", APIVersion:"v1", ResourceVersion:"17303484", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" : invalid character 'B' after top-level value
Thanks. I see the error from the api call to ovirt says 'Bad Request' , might be that the node name or id isn't sent to the flexvolume driver request. From the logs I see:
W0227 12:57:34.563550 1 driver-call.go:144] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ovirt~ovirt-flexvolume-driver/ovirt-flexvolume-driver, args: [attach {"kubernetes.io/fsType":"ext4","kubernetes.io/pvOrVolumeName":"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206","kubernetes.io/readwrite":"rw"}
Where the NODE_NAME should have the the real node name.
What is the output of:
oc get nodes
journtalctl --since="-2h"
The output of oc get nodes matches the node name from the earlier log.
Then the journal from the masters only shows logs from when we deployed the driver, not from the actual attempt to attach a volume:
Feb 27 13:40:17 master-<NAME> origin-node[52498]: I0227 13:40:17.684563 52498 kuberuntime_manager.go:385] No sandbox for pod "ovirt-flexvolume-driver-7ppv8_default(d5e9f9b6-3a8c-11e9-894d-001a4a1601d8)" can be found. Need to start a new one
Feb 27 13:40:18 master-<NAME> origin-node[52498]: I0227 13:40:18.566317 52498 kubelet.go:1865] SyncLoop (PLEG): "ovirt-flexvolume-driver-7ppv8_default(d5e9f9b6-3a8c-11e9-894d-001a4a1601d8)", event: &pleg.PodLifecycleEvent{ID:"d5e9f9b6-3a8c-11e9-894d-001a4a1601d8", Type:"ContainerStarted", Data:"9a378e4c63abd66ab445dd85c7cbe61243e2d6bdf63974445c0d94be5fb25e12"}
Lets see the journal on the node then.
One that node the journal had something in that time range.
Feb 27 13:57:21 <NODE_NAME> origin-node[75113]: E0227 13:57:21.188530 75113 kubelet.go:1599] Unable to mount volumes for pod "ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)": timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]; skipping pod
Feb 27 13:57:21 <NODE_NAME> origin-node[75113]: E0227 13:57:21.188639 75113 pod_workers.go:186] Error syncing pod eef60fc9-3a8e-11e9-894d-001a4a1601d8 ("ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)"), skipping: timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]
Feb 27 13:57:26 <NODE_NAME> origin-node[75113]: I0227 13:57:26.171808 75113 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" (UniqueName: "flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206") pod "ovirt-pv-test-1-6xcv4" (UID: "eef60fc9-3a8e-11e9-894d-001a4a1601d8")
Feb 27 13:57:26 <NODE_NAME> origin-node[75113]: E0227 13:57:26.179196 75113 nestedpendingoperations.go:267] Operation for "\"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\"" failed. No retries permitted until 2019-02-27 13:59:28.179152063 +0100 CET m=+1307744.338005120 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\" (UniqueName: \"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\") pod \"ovirt-pv-test-1-6xcv4\" (UID: \"eef60fc9-3a8e-11e9-894d-001a4a1601d8\") "
Feb 27 13:57:32 <NODE_NAME> origin-node[75113]: W0227 13:57:32.924486 75113 plugin-defaults.go:32] flexVolume driver ovirt/ovirt-flexvolume-driver: using default GetVolumeName for volume pvc-72440cf7-3a8e-11e9-a47a-001a4a160206
Feb 27 13:59:28 <NODE_NAME> origin-node[75113]: I0227 13:59:28.216938 75113 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-72440cf7-3a8e-11e9-a47a-001a4a160206" (UniqueName: "flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206") pod "ovirt-pv-test-1-6xcv4" (UID: "eef60fc9-3a8e-11e9-894d-001a4a1601d8")
Feb 27 13:59:28 <NODE_NAME> origin-node[75113]: E0227 13:59:28.223512 75113 nestedpendingoperations.go:267] Operation for "\"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\"" failed. No retries permitted until 2019-02-27 14:01:30.223468819 +0100 CET m=+1307866.382321865 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\" (UniqueName: \"flexvolume-ovirt/ovirt-flexvolume-driver/pvc-72440cf7-3a8e-11e9-a47a-001a4a160206\") pod \"ovirt-pv-test-1-6xcv4\" (UID: \"eef60fc9-3a8e-11e9-894d-001a4a1601d8\") "
Feb 27 13:59:35 <NODE_NAME> origin-node[75113]: E0227 13:59:35.852597 75113 kubelet.go:1599] Unable to mount volumes for pod "ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)": timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]; skipping pod
Feb 27 13:59:35 <NODE_NAME> origin-node[75113]: E0227 13:59:35.852724 75113 pod_workers.go:186] Error syncing pod eef60fc9-3a8e-11e9-894d-001a4a1601d8 ("ovirt-pv-test-1-6xcv4_sre(eef60fc9-3a8e-11e9-894d-001a4a1601d8)"), skipping: timeout expired waiting for volumes to attach or mount for pod "sre"/"ovirt-pv-test-1-6xcv4". list of unmounted volumes=[volume-pqxdn]. list of unattached volumes=[volume-pqxdn default-token-qqg4w]
Feb 27 13:59:50 <NODE_NAME> origin-node[75113]: W0227 13:59:50.981326 75113 plugin-defaults.go:32] flexVolume driver ovirt/ovirt-flexvolume-driver: using default GetVolumeName for volume pvc-72440cf7-3a8e-11e9-a47a-001a4a160206
I suspect your user, theStorageAdmin doesn' have enought permission on ovirt-engine side to perform the attach action. Examin the /var/log/ovirt-engine/engine.log for that. Guessing by the name of the user you did follow the document on how to create the ovirt user, but still the permission side may be missing. For a test, can you try the admin@internal user?
Indeed the oVirt Engine log shows a permission issue:
2019-02-27 13:58:25,895+01 INFO [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-2250) [f2aab36c-407e-4d29-a3e3-8c3ebada4dd1] No permission found for user 'b3665e4b-7e6b-41e6-850b-bcc4147392b5' or one of the groups he is member of, when running action 'AttachDiskToVm', Required permissions are: Action type: 'USER' Action group: 'CONFIGURE_VM_STORAGE' Object type: 'VM' Object ID: 'c7d0b43e-e52a-41e6-a046-109b80eaab67'.
2019-02-27 13:58:25,895+01 WARN [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-2250) [f2aab36c-407e-4d29-a3e3-8c3ebada4dd1] Validation of action 'AttachDiskToVm' failed for user theStorageAdmin@internal-authz. Reasons: VAR__ACTION__ATTACH_ACTION_TO,VAR__TYPE__DISK,USER_NOT_AUTHORIZED_TO_PERFORM_ACTION
2019-02-27 13:58:25,896+01 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2250) [] Operation Failed: [User is not authorized to perform this action.]
If I look at the allowed actions for the StorageAdmin role, the closest I can find are:
The role is fine, we just need it to be applied on the object which includes the vms and storage domains. In your case you need to login as admin@internal for example to ovirt-engine, go to the relevant cluster and grant 'theStorageAdmin' user the role of storageAdmin on that cluster (the cluster that the vms are running on)
The user already has that role. But is that oVirt cluster specific? If it was missing that role, how else could it be able to create virtual disks in the storage domain?
3 things form a permission - user + role + object. Make sure the role is applied on the data-center or cluster on that user. You already had the disk created, now it needs to be attached - it is a different pemission requirement.
The user is already listed under "permissions" for the oVirt datacenter. Are we talking about permissions as defined the Ansible playbook:
permissions:
- state: present
user_name: theAdmin
authz_name: internal-authz
role: StorageAdmin
object_type: data_center
object_name: Default
In that context what is Default? We have one user theStorageAdmin, it has the StorageAdmin role and that user is assigned to the oVirt datacenter where are VMs are running. Shouldn't that cover it? At least creating virtual disks in the storage domain works.
Ok I see what is missing - the user should have one more role in order to successfully attach disk to VMs. All you need is to add another permission with role 'UserVmManager' , so the whole permission snippet should look like this:
permissions:
- state: present
user_name: theAdmin
authz_name: internal-authz
role: StorageAdmin
object_type: data_center
object_name: Default
- state: present
user_name: theAdmin
authz_name: internal-authz
role: UserVmManager
object_type: data_center
object_name: Default
And yes, 'Default' is the name of the Data Center you will apply the permissions on to.
I can confirm that adding the UserVmManager permission to the oVirt user has fixed the issue. The volumes are now properly mounting to the pods.
@TFK1410 thank you
Description A newly created pod with a PV attached doesn't spin up and is stuck in container creating state with the FailedAttachVolume event repeating. This issue is simillar to issue #100 which was supposed to fix this scenario. Steps To Reproduce
Expected behavior The pod is successfully deployed. Versions: