okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.78k stars 297 forks source link

Bootstrap not tearing down, workers can't retrieve config #209

Closed jam01 closed 4 years ago

jam01 commented 4 years ago

Per https://github.com/openshift/okd/issues/174#issuecomment-630832043 opening a new issue...

Describe the bug

It seems the worker nodes are not able to retrieve their ignition configs. It seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1812409 and/or https://github.com/openshift/okd/issues/174 although unlike https://github.com/openshift/okd/issues/174#issuecomment-630416683, the notice that bootstrap can now be removed is not present.

The message on the worker is:

ignition[587] : GET result: Internal Server Error
ignition[587] : GET https://192.168.1.51:22623/config/worker: attempt #1014

Version

okd: 4.4.0-0.okd-2020-05-23-055148-beta5 provider: oVirt 4.4.0

How reproducible

100% reproducible. On a clean install of oVirt 4.4.0 and launching an installer for 1 master and 2 workers.

Though I think it's unrelated there's a couple things I should point out:

Log bundle

https://drive.google.com/uc?id=1vyAbVfz1EnZ4RWVg-45y3tAboyHo589Y&export=download

The zipped file contains the logs bundle, the logs from the installer itself, result from must-gather (404 Not Found) and the install-config.yaml


If there's more data I can provide let me know. I tried to go through docs and existing issues to provide all that I could.

vrutkovs commented 4 years ago

the installer tries to use cloud-init

We don't use cloud-init

I've manually set the VMs to use Red Hat CoreOS

Why? It won't boot since it won't accept Ignition spec 3

jam01 commented 4 years ago

Wow, so quick!

The terraform files that the installer creates do not include

  os {
    type = "rhcos_x64"
  }

as shown here: https://github.com/openshift/installer/blob/master/data/data/ovirt/template/main.tf#L52, and so the oVirt VMs created default to

        <os>
            <type>other</type>
        </os>

which result in cloud-init being the only option to add a initialization.custom_script: image If I leave it as other then the boot fails with this message:

ignition[802]: error at line 1 col 2: invalid character '#' looking for beginning of value

I just figured that was an issue with 4.4.0-0.okd-2020-05-23-055148-beta5 being out of sync with the https://github.com/openshift/installer repo. But I realize now that may be a wrong assumption. I took a look at the terraform files generated in /tmp/ and saw those lines missing...

When I do change the oVirt VMs to RHEL COS then Ignition becomes available for the init script and the machines boot successfully image

vrutkovs commented 4 years ago

Seems to be an oVirt bug, not sure why it won't let Fedora CoreOS use user-data as Ignition.

@rgolangh any ideas what's happening here?

jam01 commented 4 years ago

Am I incorrect in thinking the VMs should launch as RHCOS? I'm a little lost on how the source for OCP and OKD installers differ.

jam01 commented 4 years ago

I remembered where I read about selecting RHCOS

https://www.ovirt.org/develop/release-management/features/virt/coreos-ignition-support.html#user-work-flows

In 4.4, the only way to enable ignition to a VM is by selecting the RedHat CoreOS Operation System Type. As a result, the section of Initial Boot will changed to ignition and show only the available options for ignition. It is not required to insert custom script if it is not needed. A ignition version will be automatically added from the engine.

rgolangh commented 4 years ago

@jam01 you are spot on, the template should be created with os type rhcos_x64 which will lead them to be RHCOS type, and that would instruct to pass the user data as ignition.

The 'fcos' branch on openshift/installer and openshift/cluster-api-provider-ovirt needs cherry picks. @vrutkovs is there a periodic rebase/cherry-pick or we need to do it manually?

vrutkovs commented 4 years ago

Manual rebases/cherrypicks are required. We're going to switch to release-4.5 very soon

vrutkovs commented 4 years ago

OKD 4.5 nightlies should have a fix for that, could you give it a try?

jam01 commented 4 years ago

So I can report that RHCOS is correctly selected, and VMs are booting from ignition correctly. However, it seems ovirt.cpu.cores is not correctly being set and it's preventing VMs from being provisioned in my current setup. Eg. I setup 4 cores for masters but oVirt shows the created VM as wanting 8 and therefore they're not starting...

Will dig in a little more and see if the terraform files are indeed missing the cores fields, if not there may be other oVirt cherry picks necessary.

jam01 commented 4 years ago

I can see that Terraform is setup to use the cores given in the config which is 4, yet in oVirt they're created with a value of 8... I've manually set them to 4 and ran a wait-for install-complete to see if it does complete.

@rgolangh is there any way to see in oVirt what the API request was? I found the location for logs but it only has the HTTP method and URI.

The issue with the cores was on my end. The install is not completing however. I'll round up the logs and report back.

vrutkovs commented 4 years ago

It seems local dns resolver is incorrectly prepended - https://github.com/openshift/installer/pull/3782 would fix that

jam01 commented 4 years ago

Some updates creating a 3 masters and 2 workers cluster with 4.5.0-0.okd-2020-06-26-184819:

At this point I think the original issue, the one I openend this ticket about is either rooted in the default network type and/or that I tried to create single master cluster (I did see some errors from etcd somewhere about needing 3 replicas). I'm happy to try to gather the logs for the networkType issue, the provisioned workers, or the single master using this version at some point (unless single master is explicitly not supported and I missed that). But I'll take advantage of the working setup for now to run some experiments I'm working on. I'm actually preparing to apply for a position at RH :o

The oVirt - CoreOS issues do seem resolved :)

vrutkovs commented 4 years ago

Lets close this and file more specific bugs.

jam01 commented 4 years ago

Fair enough, thanks @vrutkovs