oVirt / ovirt-openshift-extensions

Implementation of flexvolume driver and provisioner for oVirt
Apache License 2.0
31 stars 16 forks source link

Disk mounted in wrong VM (Atomic Host/Openshift 3.11) #106

Closed danragnar closed 5 years ago

danragnar commented 5 years ago

Description Disk is created fine, but when creating the pod where the disk/volume should be attached, it gets mounted in the wrong VM (master instead of node). FlexVolume is installed on all nodes and seems to propagate to the containerized kubelet, and the vm id in the flexvolume config is consistent with the vm id in ovirt.

Steps To Reproduce

  1. Install openshift/okd 3.11 on rhel atomic host
  2. Modify paths to flexvolume driver on atomic host (/etc/origin/kubelet-plugins/...) in repo and build containers locally. Push the flexvolume driver to the openshift registry and deploy with locally modified apb docker image that fixes docker image location and paths for flex volume driver
  3. Create Storage Class according to example
  4. Create PVC according to example (disk is created fine)
  5. Create Pod according to example. disk is mounted on the elected master (i think) instead of the node where pod is running.

Expected behavior Disk should be mounted on correct VM.

Versions:

Logs: Master

2019-02-07 15:47:32,183+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (default task-51) [25276cec-04fc-4c94-b00a-77a8fdb52140] FINISH, HotPlugDi skVDSCommand, log id: 2743ef58 2019-02-07 15:47:32,203+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-51) [25276cec-04fc-4c94-b00a-77a8fdb52140] EVENT_ID: USER_ATTACH_DISK_TO_VM(2,016), Disk pvc-143c07fb-2ae7-11e9-93d9-001a4a160194 was successfully attached to VM ocp-master-01.domain.name by admin@internal-authz.

rgolangh commented 5 years ago

First, I see that I didn't check the disk result to return empty ovirt-flexdriver.go:258 - and that explains why the isAttach call panics.

danragnar commented 5 years ago

Hmm, yea, sorry about the pvc inconsistency. I have debugged a lot, so there might be different deployment attempts in the logs, it's consistent however.

I have tried multiple different versions of the driver, from older versions where the name of the VM is used instead of the ID to the latest build. Same problem. I have now reinstalled my cluster on regular RHEL 7.6 VM:s, and now it works. I specifically installed OpenShift to give this project a try and thought I would give Atomic a shot since the openshift install seemed a lot easier.

If you want to continue to troubleshoot, I can bring up a new cluster based on Atomic if you want. Otherwise you can close this issue if you want to. It works as expected on regular RHEL.

rgolangh commented 5 years ago

Can you verify that openshift made the call to flexvolume on the node and not on master? its openshift responsibility to call out, or execute the 'attach' command. Also, you did make sure that whatever ovirtVmId you had in ovirt-flexvolume-driver.conf was matching the VM it was deployed on, right? I assume the fact that you used atomic was the reason you had to deploy the driver on /etc?

p.s thanks a lot for reporting that.

danragnar commented 5 years ago

Well both the master and node receives the call, but when the node tries to mount the device on the system it isn't there, as it is mounted on the master. The ID's and virtual machine names were consistent with oVirt. Yes exactly, I deployed it with a modifed APB that mounts /etc/origin/kubelet-plugins/volume/exec/ in the "regular" path inside the driver container. So the driver ends up in the correct directory for the kubelet.

As I said, I took down the environment and got it working on regular RHEL, which I'll be happy to continue with. I saw that there was some issue previously reporting very similar problems (however not on OKD/origin 3.11). Do you have the possibility to try to reproduce the fault on Atomic on your end?

rgolangh commented 5 years ago

Well both the master and node receives the call, but when the node tries to mount the device on the system it isn't there, as it is mounted on the master.

If the master got the attach call-out then this is not a good thing and probably is an openshift bug. I'd check the pod logs of master-controllers-XYZ on the kube-system namespace. As far as I remember the attach should be called on the node that runs the pod. If that's not the case, then that's my bug because I extract the VM id from the underlying system at the time of the call.

danragnar commented 5 years ago

I'm not 100% sure that both get the attach call. I can't check anymore since I don't have the environment anymore. I'll close the issue. If I give atomic another shot and experience the same issues, I'll open a new one and reference this one. Thanks for the help!

levindecaro commented 5 years ago

Hi, I got exact the same issues, but on CentOS Linux, okd3.11, everything is fresh installed, vdisks was created by provisioner, however, the pvc will attach to origin master node when the pod is launching. As a result, the actual assigned pod node will pending on the volume and ultimately mount fail. Noticed that ovirt logs got few attach/detach vdisk operation with that pvc to master node 1.

Checked every node ovirtVmId is correct in /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ovirt~ovirt-flexvolume-driver/ovirt-flexvolume-driver.conf

rgolangh commented 5 years ago

@levindecaro can you get the logs of the master-controller-xyz pod under kube-system namespace?

something like:

oc logs -n kube-system pods/master-controllers-$(hostname)
levindecaro commented 5 years ago

Here you are, thanks

oc1.log oc2.log oc3.log

rgolangh commented 5 years ago

@danragnar @levindecaro a fix is pushed, not merged yet. The CI will build a test container and you can use it to test the fix at your env(I'll paste the link as soon as its ready)

levindecaro commented 5 years ago

@rgolangh brilliant, will test it asap.

rgolangh commented 5 years ago

I made another iteration so it will work for default kubernets and default openshift configurations. Checkout quay.io[1] for the latest tag - you should have one in ~30 minutes from now.

You can also follow the pull request conversation for updates.

[1] https://quay.io/repository/rgolangh/ovirt-flexvolume-driver?tab=tags

levindecaro commented 5 years ago

problem resolved. thank you.