openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.38k forks source link

libvirt: can't create workers with non-default storage path #308

Closed mrogers950 closed 5 years ago

mrogers950 commented 6 years ago

If your libvirt default storage pool is not /var/lib/libvirt/images then the libvirt-machine-controller fails to create the workers:

$ oc logs pod/clusterapi-controllers-85f6bfd9d5-6rbb8 -n openshift-cluster-api -c libvirt-machine-controller
...
I0921 18:24:47.612590       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:47.614725       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base
I0921 18:24:48.016159       1 controller.go:79] Running reconcile Machine for worker-2fp6s
I0921 18:24:48.023462       1 actuator.go:70] Checking if machine worker-2fp6s for cluster dev exists.
I0921 18:24:48.023638       1 logs.go:41] [DEBUG] Check if a domain exists
I0921 18:24:48.029976       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.030852       1 controller.go:123] reconciling machine object worker-2fp6s triggers idempotent create.
I0921 18:24:48.033465       1 actuator.go:46] Creating machine "worker-2fp6s" for cluster "dev".
I0921 18:24:48.036047       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.036107       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-2fp6s for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:48.038373       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base

I worked around it with a bind mount. If it's not configurable then it would be handy if it was. (also, there's a typo "Coud" in the error message)

wking commented 6 years ago

I hit this too, and am trying to track down where the /var/lib/libvirt/images is coming from. On my host:

$ sudo ls /var/lib/libvirt/images/
$ virsh -c qemu+tcp://192.168.122.1/system pool-list
 Name                 State      Autostart 
-------------------------------------------
 default              active     yes       

$ virsh -c qemu+tcp://192.168.122.1/system pool-dumpxml default
<pool type='dir'>
  <name>default</name>
  <uuid>c20a2154-aa60-44cf-bf37-cd8b7818a4e4</uuid>
  <capacity unit='bytes'>105554829312</capacity>
  <allocation unit='bytes'>44038131712</allocation>
  <available unit='bytes'>61516697600</available>
  <source>
  </source>
  <target>
    <path>/home/trking/VirtualMachines</path>
    <permissions>
      <mode>0777</mode>
      <owner>114032</owner>
      <group>114032</group>
      <label>system_u:object_r:virt_image_t:s0</label>
    </permissions>
  </target>
</pool>

$ ls /home/trking/VirtualMachines/
bootstrap  bootstrap.ign  coreos_base  master0  master-0.ign  worker.ign

In the installer, we have settings for the pool and volume, which we currently hard-code to default and coreos_base. We set the volume, so hard-coding that shouldn't be a problem. We don't set pool when we create the volume, so we get the Terraform provider's default default. So far, so good. We push those values into the cluster since #205 for the machine-config-operator to pick up (openshift/machine-config-operator#47). Our ImagePool and ImageVolume settings are recent (#271), but the MCO doesn't seem to be looking at either the old QCOWImagePath or the new Image* properties (at least as of openshift/machine-config-operator@d948fb8baa63a).

Then the chain of custody gets fuzzy for me.

On the other end, the /var/lib/... path is the actuator default for baseVolumePath, but I'm not clear on whether baseVolumePath plays into our chain.

From your logged:

I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base

we see that by the time we got here, we had default as the poolName (correct) and /var/lib/libvirt/images/coreos_base as the baseVolumeID (questionable). Then we lookup the pool. And then we die trying to find the volume in that pool. We should be looking up the volume by coreos_base; e.g. with virsh:

$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

The issue is probably the fact that that lookup also works when you happen to use the correct full path:

$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol /home/trking/VirtualMachines/coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

So can we connect the dots between our config and the busted baseVolumeID? The actuator is getting the value from machine-provider config. Who writes that config? Maybe the machine-API operator using this template? A short-term patch is probably updating that template to use just coreos_base. A long-term fix is probably updating something (that same template?) to use a value pulled (possibly indirectly) from the cluster config the installer is pushing.

wking commented 6 years ago

Possible fix in openshift/machine-api-operator#70.

nhosoi commented 5 years ago

I ran into the similar issue with v0.9.1.

For an experiment, I replaced the following hardcoded path "/var/lib/libvirt/images" with my storage path (/home/VMpool). https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L74

Then, my worker node image and its ignition file are placed in the storage path.

# ls /home/VMpool
ntest0-base  ntest0-master-0  ntest0-master.ign  ntest0-worker-0-4vlw2  ntest0-worker-0-4vlw2.ignition

But the worker node failed to start with this error: W0113 17:45:53.363756 1 controller.go:183] unable to create machine ntest0-worker-0-4vlw2: ntest0/ntest0-worker-0-4vlw2: error creating libvirt machine: error creating domain Failed to setDisks: Can't retrieve volume /var/lib/libvirt/images/ntest0-worker-0-4vlw2

Obviously, it expects to see the worker node image ntest0-worker-0-4vlw2 in /var/lib/libvirt/images instead of my storage path /home/VMpool... But I cannot find the place /var/lib/libvirt/images is hardcoded or expected as a default path in the installer. Do you have any idea how I can workaround this issue?

Thanks!

wking commented 5 years ago

But I cannot find the place /var/lib/libvirt/images is hardcoded...

openshift/cluster-api-provider-libvirt#45 (the successor to openshift/machine-api-operator#70 linked above).

nhosoi commented 5 years ago

Thank you, @wking.

I hope openshift/cluster-api-provider-libvirt#45 is going to be merged and cluster-api-provider-libvirt will be rebuilt soon...

steven-ellis commented 5 years ago

This is still an outstanding issue. I'm using #1371 to bootstrap on libvirt and just hit this storage issue,

markmc commented 5 years ago

Looks like this is the issue: https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L71

                Volume: &libvirtprovider.Volume{
            PoolName:     "default",
            BaseVolumeID: fmt.Sprintf("/var/lib/libvirt/images/%s-base", clusterID),
        },

i.e. when the installer generating the provider spec for machines, it guesses what the volume ID generated by libvirt for the base image will be

The base image is created here https://github.com/openshift/installer/blob/master/data/data/libvirt/main.tf#L5

module "volume" {
  source = "./volume"

  cluster_id = "${var.cluster_id}"
  image      = "${var.os_image}"
}

and the volume id is referenced as ${module.volume.coreos_base_volume_id}

Probably the easiest solution is to allow configuring the volume in the provider spec with a name rather than volume id

i.e. right now we require:

      volume:
        poolName: default
        baseVolumeID: /var/lib/libvirt/images/coreos_base

but there's no reason to require the volume key if we just have the pool and the volume name. This should be sufficient:

      volume:
        poolName: default
        baseVolumeName: coreos_base

Of course, the actuator needs a patch to do virStorageVolLookupByName() in this case rather than the virStorageVolLookupByKey() it does now

zeenix commented 5 years ago

Of course, the actuator needs a patch to do virStorageVolLookupByName() in this case rather than the virStorageVolLookupByKey() it does now

Correct. This was done in https://github.com/openshift/cluster-api-provider-libvirt/pull/45 which I've finally rebased and reworked a bit. I'm going to test it today and create a new PR.

steven-ellis commented 5 years ago

Looks like the fix is now in openshift/cluster-api-provider-libvirt#144 I'm re-testing on my local environment off master.

zeenix commented 5 years ago

@steven-ellis the libvirt actuator bit, yes but the Installer part is still not merged due to CI being flaky: https://github.com/openshift/installer/pull/1628