Closed mrogers950 closed 5 years ago
I hit this too, and am trying to track down where the /var/lib/libvirt/images
is coming from. On my host:
$ sudo ls /var/lib/libvirt/images/
$ virsh -c qemu+tcp://192.168.122.1/system pool-list
Name State Autostart
-------------------------------------------
default active yes
$ virsh -c qemu+tcp://192.168.122.1/system pool-dumpxml default
<pool type='dir'>
<name>default</name>
<uuid>c20a2154-aa60-44cf-bf37-cd8b7818a4e4</uuid>
<capacity unit='bytes'>105554829312</capacity>
<allocation unit='bytes'>44038131712</allocation>
<available unit='bytes'>61516697600</available>
<source>
</source>
<target>
<path>/home/trking/VirtualMachines</path>
<permissions>
<mode>0777</mode>
<owner>114032</owner>
<group>114032</group>
<label>system_u:object_r:virt_image_t:s0</label>
</permissions>
</target>
</pool>
$ ls /home/trking/VirtualMachines/
bootstrap bootstrap.ign coreos_base master0 master-0.ign worker.ign
In the installer, we have settings for the pool and volume, which we currently hard-code to default
and coreos_base
. We set the volume, so hard-coding that shouldn't be a problem. We don't set pool
when we create the volume, so we get the Terraform provider's default
default. So far, so good. We push those values into the cluster since #205 for the machine-config-operator to pick up (openshift/machine-config-operator#47). Our ImagePool
and ImageVolume
settings are recent (#271), but the MCO doesn't seem to be looking at either the old QCOWImagePath
or the new Image*
properties (at least as of openshift/machine-config-operator@d948fb8baa63a).
Then the chain of custody gets fuzzy for me.
On the other end, the /var/lib/...
path is the actuator default for baseVolumePath
, but I'm not clear on whether baseVolumePath
plays into our chain.
From your logged:
I0921 18:24:47.612648 1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base
we see that by the time we got here, we had default
as the poolName
(correct) and /var/lib/libvirt/images/coreos_base
as the baseVolumeID
(questionable). Then we lookup the pool. And then we die trying to find the volume in that pool. We should be looking up the volume by coreos_base
; e.g. with virsh
:
$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol coreos_base --pool default
Name: coreos_base
Type: file
Capacity: 16.00 GiB
Allocation: 1.55 GiB
The issue is probably the fact that that lookup also works when you happen to use the correct full path:
$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol /home/trking/VirtualMachines/coreos_base --pool default
Name: coreos_base
Type: file
Capacity: 16.00 GiB
Allocation: 1.55 GiB
So can we connect the dots between our config and the busted baseVolumeID
? The actuator is getting the value from machine-provider config. Who writes that config? Maybe the machine-API operator using this template? A short-term patch is probably updating that template to use just coreos_base
. A long-term fix is probably updating something (that same template?) to use a value pulled (possibly indirectly) from the cluster config the installer is pushing.
Possible fix in openshift/machine-api-operator#70.
I ran into the similar issue with v0.9.1.
For an experiment, I replaced the following hardcoded path "/var/lib/libvirt/images" with my storage path (/home/VMpool). https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L74
Then, my worker node image and its ignition file are placed in the storage path.
# ls /home/VMpool
ntest0-base ntest0-master-0 ntest0-master.ign ntest0-worker-0-4vlw2 ntest0-worker-0-4vlw2.ignition
But the worker node failed to start with this error:
W0113 17:45:53.363756 1 controller.go:183] unable to create machine ntest0-worker-0-4vlw2: ntest0/ntest0-worker-0-4vlw2: error creating libvirt machine: error creating domain Failed to setDisks: Can't retrieve volume /var/lib/libvirt/images/ntest0-worker-0-4vlw2
Obviously, it expects to see the worker node image ntest0-worker-0-4vlw2 in /var/lib/libvirt/images instead of my storage path /home/VMpool... But I cannot find the place /var/lib/libvirt/images is hardcoded or expected as a default path in the installer. Do you have any idea how I can workaround this issue?
Thanks!
But I cannot find the place /var/lib/libvirt/images is hardcoded...
openshift/cluster-api-provider-libvirt#45 (the successor to openshift/machine-api-operator#70 linked above).
Thank you, @wking.
I hope openshift/cluster-api-provider-libvirt#45 is going to be merged and cluster-api-provider-libvirt will be rebuilt soon...
This is still an outstanding issue. I'm using #1371 to bootstrap on libvirt and just hit this storage issue,
Looks like this is the issue: https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L71
Volume: &libvirtprovider.Volume{
PoolName: "default",
BaseVolumeID: fmt.Sprintf("/var/lib/libvirt/images/%s-base", clusterID),
},
i.e. when the installer generating the provider spec for machines, it guesses what the volume ID generated by libvirt for the base image will be
The base image is created here https://github.com/openshift/installer/blob/master/data/data/libvirt/main.tf#L5
module "volume" {
source = "./volume"
cluster_id = "${var.cluster_id}"
image = "${var.os_image}"
}
and the volume id is referenced as ${module.volume.coreos_base_volume_id}
Probably the easiest solution is to allow configuring the volume in the provider spec with a name rather than volume id
i.e. right now we require:
volume:
poolName: default
baseVolumeID: /var/lib/libvirt/images/coreos_base
but there's no reason to require the volume key if we just have the pool and the volume name. This should be sufficient:
volume:
poolName: default
baseVolumeName: coreos_base
Of course, the actuator needs a patch to do virStorageVolLookupByName()
in this case rather than the virStorageVolLookupByKey()
it does now
Of course, the actuator needs a patch to do
virStorageVolLookupByName()
in this case rather than thevirStorageVolLookupByKey()
it does now
Correct. This was done in https://github.com/openshift/cluster-api-provider-libvirt/pull/45 which I've finally rebased and reworked a bit. I'm going to test it today and create a new PR.
Looks like the fix is now in openshift/cluster-api-provider-libvirt#144 I'm re-testing on my local environment off master.
@steven-ellis the libvirt actuator bit, yes but the Installer part is still not merged due to CI being flaky: https://github.com/openshift/installer/pull/1628
If your libvirt default storage pool is not /var/lib/libvirt/images then the libvirt-machine-controller fails to create the workers:
I worked around it with a bind mount. If it's not configurable then it would be handy if it was. (also, there's a typo "Coud" in the error message)