oVirt IPI - Githubissues

bsperduto commented 4 years ago

Just a quick update I did attempt to test the oVirt IPI of OKD using the directions at https://github.com/openshift/installer/blob/master/docs/user/ovirt/install_ipi.md. Unfortunately the OKD installer does not currently appear to have oVirt as an available platform. Up to that point those directions do work in my testing. I have not attempted a UPI type install yet.

vrutkovs commented 4 years ago

Unfortunately the OKD installer does not currently appear to have oVirt as an available platform

Its not present in survey, but it does accept platform: ovirt in install-config.yaml

There are a few fixes required though:

https://github.com/openshift/installer/pull/3008 to properly uncompress FCOS images (can be worked around by uploading the image manually)
terraform-provider-ovirt's fork needs this commit - I'll make sure installer is updated too

bsperduto commented 4 years ago

Went ahead and took a stab at creating a install-config.yaml. I'm getting a terraform error like below. I'm not convinced my config file is correct though, do you have a working example you can share?

Thanks

INFO Creating infrastructure resources...
ERROR
ERROR Error: Tag not matched: expect but got ERROR
ERROR on ../../tmp/openshift-install-760331462/template/main.tf line 11, in data "ovirt_templates" "osImage": ERROR 11: data "ovirt_templates" "osImage" {
ERROR
ERROR
ERROR
ERROR Error: Tag not matched: expect but got ERROR
ERROR on ../../tmp/openshift-install-760331462/template/main.tf line 18, in data "ovirt_clusters" "clusters": ERROR 18: data "ovirt_clusters" "clusters" {
ERROR
ERROR
ERROR Failed to read tfstate: open /tmp/openshift-install-760331462/terraform.tfstate: no such file or directory FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

vrutkovs commented 4 years ago

Check your ovirt-config.yaml. ovirt_url has to be an API endpoint, in my case for WebUI at https://foo.example.com:8443/ovirt-engine ovirt_url was https://foo.example.com:8443/ovirt-engine/api

LorbusChris commented 4 years ago

* terraform-provider-ovirt's fork needs [this commit](https://github.com/vrutkovs/terraform-provider-ovirt/commit/029b02f9d39775a68511dc0f6ac5d226a1d6f826) - I'll make sure installer is updated too

@vrutkovs let's not use a fork for the ovirt provider, mind opening a PR for that commit upstream?

bsperduto commented 4 years ago

Check your ovirt-config.yaml. ovirt_url has to be an API endpoint, in my case for WebUI at https://foo.example.com:8443/ovirt-engine ovirt_url was https://foo.example.com:8443/ovirt-engine/api

Great, I was able to progress quite a bit. It successfully created the bootstrap and masters and was able to boot them. It appeared that keepalived never came up on the bootstrap node though or it never began broadcasting on the ip as expected. The master nodes kept searching for the MCO at that ip but couldnt reach it. Do you have a command to get a log from keepalived?

vrutkovs commented 4 years ago

Use oc adm must-gather to collect the necessary info

bsperduto commented 4 years ago

When I attempt to do oc adm must-gather I'm getting a no route to host error as it's trying to connect to the keepalived instance that isn't running.

Below is the status output of the kubelet running on the bootstrap node, this seems to be the most relevant information

Feb 04 01:24:34 localhost hyperkube[4210]: I0204 01:24:34.478718 4210 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach Feb 04 01:24:34 localhost hyperkube[4210]: I0204 01:24:34.481150 4210 kubelet_node_status.go:486] Recording NodeHasSufficientMemory event message for node localhost Feb 04 01:24:34 localhost hyperkube[4210]: I0204 01:24:34.481189 4210 kubelet_node_status.go:486] Recording NodeHasNoDiskPressure event message for node localhost Feb 04 01:24:34 localhost hyperkube[4210]: I0204 01:24:34.481199 4210 kubelet_node_status.go:486] Recording NodeHasSufficientPID event message for node localhost Feb 04 01:24:34 localhost hyperkube[4210]: E0204 01:24:34.621141 4210 remote_runtime.go:200] CreateContainer in sandbox "b77d96611136e1c23d57d31d459f50d845eb54cfd3c8bd0da0b32e263fc193b7" from runtime service failed: rpc error: code = Unknown desc = container create failed: time="2020-02-04T01:24:34Z" level=error msg="container_linux.go:346: starting container process caused \"exec: \\"runtimecfg\\": executable file not found in $PATH\"" Feb 04 01:24:34 localhost hyperkube[4210]: container_linux.go:346: starting container process caused "exec: \"runtimecfg\": executable file not found in $PATH" Feb 04 01:24:34 localhost hyperkube[4210]: E0204 01:24:34.621261 4210 kuberuntime_manager.go:803] init container start failed: CreateContainerError: container create failed: time="2020-02-04T01:24:34Z" level=error msg="container_linux.go:346: starting container process caused \"exec: \\"runtimecfg\\": executable file not found in $PATH\"" Feb 04 01:24:34 localhost hyperkube[4210]: container_linux.go:346: starting container process caused "exec: \"runtimecfg\": executable file not found in $PATH" Feb 04 01:24:34 localhost hyperkube[4210]: E0204 01:24:34.621309 4210 pod_workers.go:191] Error syncing pod b9aca84bcb23f61afa3a448a5f4225f0 ("coredns-localhost_openshift-ovirt-infra(b9aca84bcb23f61afa3a448a5f4225f0)"), skipping: failed to "StartContainer" for "render-config" with CreateContainerError: "container create failed: time=\"2020-02-04T01:24:34Z\" level=error msg=\"container_linux.go:346: starting container process caused \\"exec: \\\\"runtimecfg\\\\": executable file not found in $PATH\\"\"\ncontainer_linux.go:346: starting container process caused \"exec: \\"runtimecfg\\": executable file not found in $PATH\"\n"

vrutkovs commented 4 years ago

Interesting, at some point we replaced baremetal-runtimcfg with a dummy image - but now it should be mirrored from OCP. Which OKD release is that? Could you give it a try on the latest 4.4 from https://origin-release.svc.ci.openshift.org/?

bsperduto commented 4 years ago

This was using the alpha2 release on github. I'll try a CI build tonight.

bsperduto commented 4 years ago

Was able to get through the initial pivot on the bootstrap and the master nodes tonight but it's getting stuck on etcd. This appears to be the same as #52. I do see etcd-signer running on the bootstrap node but it's apparently not signing the certs.

2020-02-05T02:30:42.726944855+00:00 stderr F + kube-client-agent request --kubeconfig=/etc/kubernetes/kubeconfig --orgname=system:etcd-servers --assetsdir=/etc/ssl/etcd --dnsnames=localhost,etcd.kube-system.svc,etcd.kube-system.svc.cluster.local,etcd.openshift-etcd.svc,etcd.openshift-etcd.svc.cluster.local --commonname=system:etcd-server:10.10.10.151 --ipaddrs=10.10.10.151,127.0.0.1 2020-02-05T02:30:53.219874108+00:00 stderr F ERROR: logging before flag.Parse: E0205 02:30:53.219549 7 agent.go:145] unable to retrieve approved CSR: the server could not find the requested resource (get certificatesigningrequests.certificates.k8s.io system:etcd-server:10.10.10.151). Retrying. 2020-02-05T02:30:56.221787614+00:00 stderr F ERROR: logging before flag.Parse: E0205 02:30:56.221719 7 agent.go:145] unable to retrieve approved CSR: the server could not find the requested resource (get certificatesigningrequests.certificates.k8s.io system:etcd-server:10.10.10.151). Retrying. 2020-02-05T02:30:59.222030319+00:00 stderr F ERROR: logging before flag.Parse: E0205 02:30:59.221949 7 agent.go:145] unable to retrieve approved CSR: the server could not find the requested resource (get certificatesigningrequests.certificates.k8s.io system:etcd-server:10.10.10.151). Retrying. 2020-02-05T02:31:02.221230756+00:00 stderr F ERROR: logging before flag.Parse: E0205 02:31:02.221141 7 agent.go:145] unable to retrieve approved CSR: the server could not find the requested resource (get certificatesigningrequests.certificates.k8s.io system:etcd-server:10.10.10.151). Retrying. 2020-02-05T02:31:03.221419613+00:00 stderr F ERROR: logging before flag.Parse: E0205 02:31:03.221336 7 agent.go:145] unable to retrieve approved CSR: the server could not find the requested resource (get certificatesigningrequests.certificates.k8s.io system:etcd-server:10.10.10.151). Retrying. 2020-02-05T02:31:03.221419613+00:00 stderr F Error: error requesting certificate: error obtaining signed certificate from signer: timed out waiting for the condition

abaxo commented 4 years ago

@bsperduto FWIW I had a successful ovirt IPI build from the CI build '4.4.0-0.okd-2020-02-05-224417' (https://origin-release.svc.ci.openshift.org/releasestream/4.4.0-0.okd/release/4.4.0-0.okd-2020-02-05-224417). I had some issues with being able to set the disk/memory/cpu, and in the end I deployed from a modified template (from a previous failed install) that had more disk, cpu and ram allocated by default otherwise the masters ran out of disk space for the various components to run. It wasn't immediately obvious to me how to do the override, but I eventually found a note in the dev readme's saying to do: export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE="ocp-rhcos"

So far as DNS I only configured 3 records as DNS is handled magically inside the cluster:

*.apps.lab.example.com pointed at the ingress_vip
api.lab.example.com pointed at the api_vip
api-int.lab.example.com pointed at the api_vip

Network wise, remember to turn the filter off on the ovirt network otherwise the VIP spoofing from keepalived won't work.

I also was finding that because I have a fairly slow internet connection, my installer was timing out, but I was able to just let things tick along and run openshift-install wait-for install-complete and that did the trick. If I had a bit more patience I would have mirrored the repo's instead.

The install-config.yaml I used was:

apiVersion: v1
baseDomain: example.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 2
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: test1
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  ovirt:
    api_vip: 10.50.106.20
    dns_vip: 10.50.106.21
    ingress_vip: 10.50.106.22
    ovirt_cluster_id: 33fe2fa8-eb60-11e9-bb60-00163e186f47
    ovirt_network_name: lab.example.com
    ovirt_storage_domain_id: e4323f39-f50b-4462-8df1-fff0dd587a9f
publish: External
pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
sshKey: ssh-rsa blahblahblahsomefakekeyhere my@machine

I also ended up with an ~/.ovirt/ovirt-config.yaml containing:

---
## The hostname or IP address of the ovirt engine
ovirt_url: https://ovirt.example.com/ovirt-engine/api
### The name of the user for accessing the ovirt engine
ovirt_username: admin@internal
## The password associated with the user
ovirt_password: some-awesome-password
ovirt_insecure: true

Again, sharing because I found it hard to find any good examples for ovirt and had to piece it together from the source.

Hope this helps!

Edit; tidied up and added ovirt network setting.

vrutkovs commented 4 years ago

Since https://origin-release.svc.ci.openshift.org/releasestream/4.4.0-0.okd/release/4.4.0-0.okd-2020-03-13-191636 oVirt IPI should work (previously workers didn't join the cluster).

Anyone has a cluster to verify that?

abaxo commented 4 years ago

ah awesome, I've been running builds over the last few days with whatever the current build was at the time and seen mixed results. I'll kick a build off in a few mins and let you know.

abaxo commented 4 years ago

Just an observation (not sure if this is intentional for oVirt)

the fedora coreos build is still attempting to pull, though it looks like the installer expects Fedora CoreOS 31.20200310.20: INFO Obtaining RHCOS image file from 'https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/31.20200113.3.1/x86_64/fedora-coreos-31.20200113.3.1-openstack.x86_64.qcow2.xz?sha256=8fd0f6da46285427565749754e74e4648c8516090e03185f526464ca07bf7f63'
Is there currently a way to specify CPU, Ram and disk for masters/compute with the ovirt provider? It seemed that the settings I thought were there weren't being passed through. Currently I am manually changing the ram available to avoid an OOM with the default memory setting of 8GB on the masters, and have had to forge a template from the one that is created in order to have additional disk space (8GB default) on each machine.

vrutkovs commented 4 years ago

it looks like the installer expects Fedora CoreOS 31.20200310.20

Installer starts with a specific stable FCOS build (31.20200113.3.1) and then updates all machines to latest ostree commit in machine-os-content of the release. This is why release page has Fedora CoreOS 31.20200310.20, but installer starts from an older version

Is there currently a way to specify CPU, Ram and disk for masters/compute with the ovirt provider?

@rgolangh is this already supported?

abaxo commented 4 years ago

Ah thanks okay that makes sense.

So the install hasn't reached 'install finished' yet, it looks like there is an issue with spinning up the worker nodes, they are defined as machines from the machineset, but they aren't being spawned on the compute side (oVirt 4.3.8).

The install itself looks to be 'ok' the issue is that the ovirt machine credentials are being created in the kube-system namespace, not in the openshift-machine-api namespace: I0314 00:06:18.953364 1 actuator.go:333] failed getting credentials for namespace openshift-machine-api, error getting credentials secret "ovirt-credentials" in namespace "openshift-machine-api": Secret "ovirt-credentials" not found E0314 00:06:18.953508 1 controller.go:279] Failed to check if machine "uk1-96v4d-worker-0-46rvd" exists: failed to create connection to oVirt API {"level":"error","ts":1584144378.9535766,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"machine_controller","request":"openshift-machine-api/uk1-96v4d-worker-0-46rvd","error":"failed to create connection to oVirt API","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/cluster-api-provider-ovirt/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/cluster-api-provider-ovirt/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/cluster-api-provider-ovirt/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

../oc --kubeconfig auth/kubeconfig get secret --all-namespaces | grep ovirt kube-system ovirt-credentials Opaque 5 172m openshift-ovirt-infra builder-dockercfg-p6zwm kubernetes.io/dockercfg 1 120m openshift-ovirt-infra builder-token-dhclm kubernetes.io/service-account-token 4 121m openshift-ovirt-infra builder-token-ljtxs kubernetes.io/service-account-token 4 120m openshift-ovirt-infra default-dockercfg-mpbzp kubernetes.io/dockercfg 1 120m openshift-ovirt-infra default-token-2xvp4 kubernetes.io/service-account-token 4 121m openshift-ovirt-infra default-token-2zqq9 kubernetes.io/service-account-token 4 171m openshift-ovirt-infra deployer-dockercfg-4mjg7 kubernetes.io/dockercfg 1 120m openshift-ovirt-infra deployer-token-qxjwf kubernetes.io/service-account-token 4 121m openshift-ovirt-infra deployer-token-zx9jg kubernetes.io/service-account-token 4 120m

As soon as I recreated the secret (in the correct namespace - openshift-machine-api - this time) the cluster is progressing again. Going to leave this running over night and see where it gets to, but I expect it'll be successful.

Install version: ../openshift-install 4.4.0-0.okd-2020-03-13-191636 built from commit f0d3afed3c4655a6514fdfc54bc40348f0aac80b release image registry.svc.ci.openshift.org/origin/release@sha256:8d33b9e48493042f6867bde243b9f0475ff7ba7ca14aca70670df14d62c13819

The config file I am using is essentially the one I posted earlier but with real hostnames/ips.

Hope this helps

vrutkovs commented 4 years ago

The install itself looks to be 'ok' the issue is that the ovirt machine credentials are being created in the kube-system namespace, not in the openshift-machine-api namespace:

Oh, interesting. We're using machine-config-operator and installer forks for OKD, so perhaps these are out of sync now

abaxo commented 4 years ago

Just to confirm that after updating the namespace the secret existed in, the cluster reached install-complete

abaxo commented 4 years ago

I had a bit of time today to mess around and these are the only things I found to be of interest:

the cloud-credential-operator appears to be unable to access the ovirt credential, my tactical fix probably worked around it by putting the credential in the 1 place it needs to be for machinesets to grow/shrink, but the proper way seems to be to provide the credential into the namespace for that operator to then dish out into the right places
There is no native storage for ovirt. This is quite painful for me, looks like . EmberCSI is available from the operator hub, which gives some nice options. I run FreeNAS as my backend and the chap who wrote the older flexvolume implementation for it has also written it for CSI.. https://github.com/democratic-csi this is amazing, deployed it earlier and it just works.
cluster-samples still only seems to have the enterprise samples available, but I am sure I saw a ticket to fix that already
Its a bit annoying that we can't yet configure cpu/memory/disk for machinesets within the IPI. I note that I had to bump the resources available to the masters from the default to 8core/16GB ram/100GB disk and built all machines from a template with the same resources.
I had an issue with my ovirt hosted-engine (logs device filled up) after a new build and everything was good, but just before I scaled up the worker pool. After I restored the hosted-engine to life (cleared the log dir and restarted) I had some workers in a state where the CSR's were not being approved (I expected that they would just get approved and everything would be fine), so I ended up running ../oc --kubeconfig auth/kubeconfig get csr -o name | xargs ../oc --kubeconfig auth/kubeconfig adm certificate approve to solve that, but otherwise scaling has been fine, too.

Overall this is seeming to be a very workable build, the credential issue is really the only blocker for a 'good' install, the other items are things that you can work around if needed.

bsperduto commented 4 years ago

I had a bit of time today to mess around and these are the only things I found to be of interest:

the cloud-credential-operator appears to be unable to access the ovirt credential, my tactical fix probably worked around it by putting the credential in the 1 place it needs to be for machinesets to grow/shrink, but the proper way seems to be to provide the credential into the namespace for that operator to then dish out into the right places

There is no native storage for ovirt. This is quite painful for me, looks like . EmberCSI is available from the operator hub, which gives some nice options. I run FreeNAS as my backend and the chap who wrote the older flexvolume implementation for it has also written it for CSI.. https://github.com/democratic-csi this is amazing, deployed it earlier and it just works.

cluster-samples still only seems to have the enterprise samples available, but I am sure I saw a ticket to fix that already

Its a bit annoying that we can't yet configure cpu/memory/disk for machinesets within the IPI. I note that I had to bump the resources available to the masters from the default to 8core/16GB ram/100GB disk and built all machines from a template with the same resources.

I had an issue with my ovirt hosted-engine (logs device filled up) after a new build and everything was good, but just before I scaled up the worker pool. After I restored the hosted-engine to life (cleared the log dir and restarted) I had some workers in a state where the CSR's were not being approved (I expected that they would just get approved and everything would be fine), so I ended up running ../oc --kubeconfig auth/kubeconfig get csr -o name | xargs ../oc --kubeconfig auth/kubeconfig adm certificate approve to solve that, but otherwise scaling has been fine, too.

Overall this is seeming to be a very workable build, the credential issue is really the only blocker for a 'good' install, the other items are things that you can work around if needed.

What stage of the install did you copy the secret over during? I attempted earlier today but was unsuccessful and may need to do it earlier in the process.

abaxo commented 4 years ago

I was probably 1.5hours in to the install at this point (my connection is slow), it was after the point that the cluster created worker machines, but didn't spawn them. If you take a look at the logs for openshift-machine-api operator, it'll give you a clue if it is locating the credentials or not (https://github.com/openshift/okd/issues/61#issuecomment-598980286).

Just make sure your secret is named ovirt-credentials - I literally just ran: ./oc --kubeconfig auth/kubeconfig get secret ovirt-credentials -n kube-system -o yaml > ovirt-secret.yaml then modified the yaml to update the namespace ./oc --kubeconfig auth/kubeconfig create -f ovirt-secret.yaml

If you do an oc get pods --all-namespaces the only pods that are pending should be the ones that are trying to start on worker nodes such as the router pods.

rgolangh commented 4 years ago

I had a bit of time today to mess around and these are the only things I found to be of interest:

* the cloud-credential-operator appears to be unable to access the ovirt credential, my tactical fix probably worked around it by putting the credential in the 1 place it needs to be for machinesets to grow/shrink, but the proper way seems to be to provide the credential into the namespace for that operator to then dish out into the right places

cloud-credentials-operator works with CredentialsRequest object. ovirt machine controller creates one and the credentials controller create the secret under its namespace. If you have logs from both components that would be good.

* There is no native storage for ovirt. This is quite painful for me, looks like . EmberCSI is available from the operator hub, which gives some nice options. I run FreeNAS as my backend and the chap who wrote the older flexvolume implementation for it has also written it for CSI.. https://github.com/democratic-csi this is amazing, deployed it earlier and it just works.

CSI driver, with an opertaor to deploy it are almost done https://github.com/openshift/ovirt-csi-driver

You can pick up the CSI driver and deploy it manually manually - look under deploy folder. I didn't had the chance to straiten the README's yet.

* cluster-samples still only seems to have the enterprise samples available, but I am sure I saw a ticket to fix that already

* Its a bit annoying that we can't yet configure cpu/memory/disk for machinesets within the IPI. I note that I had to bump the resources available to the masters from the default to 8core/16GB ram/100GB disk and built all machines from a template with the same resources.

* I had an issue with my ovirt hosted-engine (logs device filled up) after a new build and everything was good, but just before I scaled up the worker pool. After I restored the hosted-engine to life (cleared the log dir and restarted) I had some workers in a state where the CSR's were not being approved (I expected that they would just get approved and everything would be fine), so I ended up running `../oc --kubeconfig auth/kubeconfig get csr -o name | xargs ../oc --kubeconfig auth/kubeconfig adm certificate approve` to solve that, but otherwise scaling has been fine, too.

Probably a matter of timing. I think there is a window of 10 minutes to approve new hosts. Worst case, delete the machine and it will be recreated. (next time :))

Overall this is seeming to be a very workable build, the credential issue is really the only blocker for a 'good' install, the other items are things that you can work around if needed.

Happy this is working. An issue worth noting for hosted engine installs (and perhaps you have something to add) - https://bugzilla.redhat.com/show_bug.cgi?id=1813725

vrutkovs commented 4 years ago

I think its worth filing separate issues for each encountered problem - especially the ovirt-credentials one.

I was probably 1.5hours in to the install at this point (my connection is slow)

At this point we don't push images to Quay, but this can be worked around:

Mirror the images to a local registry - https://docs.okd.io/latest/installing/install_config/installing-restricted-networks-preparations.html, that's about 4GB now
Block registry.svc.openshift.org in the DNS so that the mirror would be used first

cluster-samples still only seems to have the enterprise samples available, but I am sure I saw a ticket to fix that already

https://github.com/openshift/okd/issues/34

Its a bit annoying that we can't yet configure cpu/memory/disk for machinesets within the IPI

Make a new customized template and set OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE env var to use it

rgolangh commented 4 years ago

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1813741 to fix the machine configuration

abaxo commented 4 years ago

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1813741 to fix the machine configuration

Is it worth also including network configuration here too? I know from experience I had customers who wanted to place say router nodes in the DMZ zone, and nodes inside their app network for the actual workload, while the control plane lived in another zone.

abaxo commented 4 years ago

Okay so, I've had a bit of time to progress a bit: 1) havent been able to work out why the cloud credentials operator isnt dishing out credentials. I am not too sure why, I can't see any reason why it wouldn't be able to work but it was only an short look. Perhaps if the credential was initially created in the kube-system namespace there is a permission missing to allow the controller to access the openshift-machine-api namespace where it expects that secret? I don't know, just guessing.

---
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  creationTimestamp: "2020-03-13T21:18:56Z"
  finalizers:
  - cloudcredential.openshift.io/deprovision
  generation: 1
  labels:
    controller-tools.k8s.io: "1.0"
  name: openshift-machine-api-ovirt
  namespace: openshift-cloud-credential-operator
  resourceVersion: "1927"
  selfLink: /apis/cloudcredential.openshift.io/v1/namespaces/openshift-cloud-credential-operator/credentialsrequests/openshift-machine-api-ovirt
  uid: 6c3fe339-d6d0-49fe-bfc6-d883712a7476
spec:
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: OvirtProviderSpec
  secretRef:
    name: ovirt-credentials
    namespace: openshift-machine-api
status:
  conditions:
  - lastProbeTime: "2020-03-13T21:20:42Z"
    lastTransitionTime: "2020-03-13T21:20:42Z"
    message: cloud creds are insufficient to satisfy CredentialsRequest
    reason: CloudCredsInsufficient
    status: "True"
    type: InsufficientCloudCreds
  lastSyncGeneration: 0
  provisioned: false

2) I overrode the image stream locations for PHP + Mysql based on the https://github.com/openshift/library/blob/master/community/mysql/imagestreams/mysql-centos7.json image streams. 3) I also deployed the ovirt-csi-driver (this is an amazing discovery, thank you!) I had an issue with the ovirt-credentials secret not being provisioned by the credentials operator (the request was correct), but once I put the secret in place it worked a treat. 4) Once storage was in place I was able to deploy the registry. Worth noting here that because there is no storage out of the box, you have to update the registry operator management state from removed to unmanaged. I followed the steps here: https://docs.openshift.com/container-platform/4.3/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html 5) Once the registry was deployed I could finally run a successful build, complete with persistent storage and have my test workload successfully running.

@rgolangh wrt nodes coming up and missing the csr signing, I agree it is probably just a timing issue. The issue certainly hasn't appeared since and I have done a couple of grow/shrinks of the worker nodes.

@vrutkovs Do you want me to raise the credentials one somewhere? TBH I was planning on taking the hit with the image sizes for a management cluster, and then running a registry (like go harbor since Quay core doesnt seem to be available)

Totally unrelated to OKD and off to the side - I only mention because I said it worked fine earlier; the FreeNAS CSI driver, everything worked up until I tried to consume the PVC with a pod. At that point the event stream was complaining the csidriver wasn't registered (it was, but the socket didnt exist in the right place)

Cheers all

vrutkovs commented 4 years ago

@vrutkovs Do you want me to raise the credentials one somewhere? TBH I was planning on taking the hit with the image sizes for a management cluster, and then running a registry (like go harbor since Quay core doesnt seem to be available)

Lets file a new OKD issue on that. It seems to be the only blocking bug, but I'd like to get another confirmation (could it be an oVirt misconfiguration)?

abaxo commented 4 years ago

It's possible that it could be, but I am just using the default internal admin account which has no restrictions. I'm happy to jump on a bluejeans or something if you want to validate.

abaxo commented 4 years ago

So I just ran through a new install against 4.4.0-0.okd-2020-03-16-194045.

This time I built from a registry mirror, that worked nicely
I also built with a proxy - this is also working. I am catching some traffic heading off to registry.redhat.io, presumably this is the controller trying to pull imagestreams. I can also see traffic going to the oauth-openshift.apps.example.com route, I didn't expect to see that over the proxy. Worth noting traffic to api.openshift.com (presumably telemetry and insights), and traffic going to some random cloudfront address (I bet this is the ovirt-csi-driver from quay.io)
cloud credential operator pulled the same trick as earlier, not a big deal now I know to fix it
I did need to invoke ./openshift-install wait-for install-complete otherwise it appeared to fail, but it was just barely by the skin of its teeth this time too slow. 40mins down from 2hours.
Providing you updated the yaml files for the ovirt-csi-driver with your storage settings, then it 'just works' by doing oc --kubeconfig auth/kubeconfig create -f ovirt-csi-driver/deploy/csi-driver/. Did also need to put the ovirt-credentials secret in this namespace while the cloud credential operator is being naughty

kube-apiserver-operator reports that ovirt is an unrecognised platform:

kube-apiserver-operator
No recognized cloud provider platform found in infrastructures.config.openshift.io/cluster.status.platform

It would be a nicer behavior if the registry started with emptyDir: {} storage, like it can on okd 3. I found not having the registry the first time to be a bit confusing until I read the doc and figured out why.

vrutkovs commented 4 years ago

I also built with a proxy - this is also working

:tada:

I am catching some traffic heading off to registry.redhat.io, presumably this is the controller trying to pull imagestreams

Yes, this is (misconfigured) samples operator - see #34

Worth noting traffic to api.openshift.com (presumably telemetry and insights)

This can be disabled by setting "fake" pullsecret. Filed https://github.com/openshift/okd/issues/107 to have this mentioned in official docs.

40mins down from 2hours.

A bit odd, maybe @rgolangh has thoughs on why post-bootstrap barely makes it in 40 mins? In any case lets file a new bug to track this

kube-apiserver-operator reports that ovirt is an unrecognised platform

That's pretty much expected since kubelet doesn't have an ovirt cloud controller. These are legacy anyway

It would be a nicer behavior if the registry started with emptyDir: {} storage

We can't make that decision for you, as emptyDir registry storage cannot be migrated to a different storage provider afterwards.

Its odd that ovirt-csi-driver could not create a PVC for registry - I guess it doesn't support RWX volumes?

vrutkovs commented 4 years ago

So I'm going to close this as oVirt (more or less) works, the blocker bug has a workaround and its worth having separate tickets for each problem.

Thanks for testing this!

okd-project / okd

oVirt IPI #61