Closed jam01 closed 4 years ago
the installer tries to use cloud-init
We don't use cloud-init
I've manually set the VMs to use Red Hat CoreOS
Why? It won't boot since it won't accept Ignition spec 3
Wow, so quick!
The terraform files that the installer creates do not include
os {
type = "rhcos_x64"
}
as shown here: https://github.com/openshift/installer/blob/master/data/data/ovirt/template/main.tf#L52, and so the oVirt VMs created default to
<os>
<type>other</type>
</os>
which result in cloud-init
being the only option to add a initialization.custom_script:
If I leave it as other
then the boot fails with this message:
ignition[802]: error at line 1 col 2: invalid character '#' looking for beginning of value
I just figured that was an issue with 4.4.0-0.okd-2020-05-23-055148-beta5 being out of sync with the https://github.com/openshift/installer repo. But I realize now that may be a wrong assumption. I took a look at the terraform files generated in /tmp/
and saw those lines missing...
When I do change the oVirt VMs to RHEL COS then Ignition becomes available for the init script and the machines boot successfully
Seems to be an oVirt bug, not sure why it won't let Fedora CoreOS use user-data as Ignition.
@rgolangh any ideas what's happening here?
Am I incorrect in thinking the VMs should launch as RHCOS? I'm a little lost on how the source for OCP and OKD installers differ.
I remembered where I read about selecting RHCOS
In 4.4, the only way to enable ignition to a VM is by selecting the RedHat CoreOS Operation System Type. As a result, the section of Initial Boot will changed to ignition and show only the available options for ignition. It is not required to insert custom script if it is not needed. A ignition version will be automatically added from the engine.
@jam01 you are spot on, the template should be created with os type rhcos_x64 which will lead them to be RHCOS type, and that would instruct to pass the user data as ignition.
The 'fcos' branch on openshift/installer and openshift/cluster-api-provider-ovirt needs cherry picks. @vrutkovs is there a periodic rebase/cherry-pick or we need to do it manually?
Manual rebases/cherrypicks are required. We're going to switch to release-4.5 very soon
OKD 4.5 nightlies should have a fix for that, could you give it a try?
So I can report that RHCOS is correctly selected, and VMs are booting from ignition correctly. However, it seems ovirt.cpu.cores is not correctly being set and it's preventing VMs from being provisioned in my current setup. Eg. I setup 4 cores for masters but oVirt shows the created VM as wanting 8 and therefore they're not starting...
Will dig in a little more and see if the terraform files are indeed missing the cores fields, if not there may be other oVirt cherry picks necessary.
I can see that Terraform is setup to use the cores given in the config which is 4, yet in oVirt they're created with a value of 8... I've manually set them to 4 and ran a wait-for install-complete to see if it does complete.
@rgolangh is there any way to see in oVirt what the API request was? I found the location for logs but it only has the HTTP method and URI.
The issue with the cores was on my end. The install is not completing however. I'll round up the logs and report back.
It seems local dns resolver is incorrectly prepended - https://github.com/openshift/installer/pull/3782 would fix that
Some updates creating a 3 masters and 2 workers cluster with 4.5.0-0.okd-2020-06-26-184819:
networkType
the masters get stuck trying to fetch config from the bootstrap. The error is Network unreachable
, however if I try to fetch the config myself I get a 500 back.OVNKubernetes
then the setup moves on normallyProvisioned
and never move to Provisioned as Node
, removing that machine and letting the machine set provision another usually helps, but not alwaysAt this point I think the original issue, the one I openend this ticket about is either rooted in the default network type and/or that I tried to create single master cluster (I did see some errors from etcd somewhere about needing 3 replicas). I'm happy to try to gather the logs for the networkType
issue, the provisioned workers, or the single master using this version at some point (unless single master is explicitly not supported and I missed that). But I'll take advantage of the working setup for now to run some experiments I'm working on. I'm actually preparing to apply for a position at RH :o
The oVirt - CoreOS issues do seem resolved :)
Lets close this and file more specific bugs.
Fair enough, thanks @vrutkovs
Per https://github.com/openshift/okd/issues/174#issuecomment-630832043 opening a new issue...
Describe the bug
It seems the worker nodes are not able to retrieve their ignition configs. It seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1812409 and/or https://github.com/openshift/okd/issues/174 although unlike https://github.com/openshift/okd/issues/174#issuecomment-630416683, the notice that bootstrap can now be removed is not present.
The message on the worker is:
Version
okd: 4.4.0-0.okd-2020-05-23-055148-beta5 provider: oVirt 4.4.0
How reproducible
100% reproducible. On a clean install of oVirt 4.4.0 and launching an installer for 1 master and 2 workers.
Though I think it's unrelated there's a couple things I should point out:
The installer launches Fedora CoreOS 31 which uses Ignition and the installer tries to use cloud-init. I've manually set the VMs to use Red Hat CoreOS so that Ignition works correctly. (That issue only took me about a whole day to figure) See: https://github.com/openshift/okd/issues/127 and https://github.com/openshift/installer/commit/3fbdaf41c76f65ab549e44473500e25acaff9c51
The installer also does not yet support platform.ovirt.osDisk.sizeGB and other properties from the install-config.yaml, so I had to set that up manually as well (This one also took me a while, the disk was getting full). See: https://github.com/openshift/installer/commit/5d0b06afb924d7c69c44006c3d299cde8e93a6ad
Log bundle
https://drive.google.com/uc?id=1vyAbVfz1EnZ4RWVg-45y3tAboyHo589Y&export=download
The zipped file contains the logs bundle, the logs from the installer itself, result from must-gather (404 Not Found) and the install-config.yaml
If there's more data I can provide let me know. I tried to go through docs and existing issues to provide all that I could.