Re-install CNV cluster using ACM

tumido commented 3 years ago

As discussed here https://github.com/operate-first/SRE/issues/62 we're gonna reinstall via ACM and upgrade to 4.7 directly

ipolonsk commented 3 years ago

This will be the first BM cluster deployment using the ACM moc-infra, we can try the deployment on 3 other BM servers before tearing apart the CNV cluster. We have 2 other BM servers available in the old CNV2 ( 3 were taken to moc-infra).

larsks commented 3 years ago

@ipolonsk I would be more inclined to just tear down the cnv cluster and rebuild it. That's pretty much what it's there for.

ipolonsk commented 3 years ago

@larsks Agree, I would like to do it together, can we schedule an hour/hour and a half to do it together?

larsks commented 3 years ago

@ipolonsk how about @ 10:30 Eastern? That's about 40 minutes from now.

tumido commented 3 years ago

I'd say we wait with it till tomorrow and plan on it properly. Let's schedule a proper timeslot so me and @HumairAK can join as well. We need to learn this as well.

Also it's not possible for me to migrate current CNV setup properly to a new cluster today (it's already 5PM over here, if we start installing right now it'll be 7PM for me to start monitoring the migrations) and @HumairAK's on PTO.

And since the email didn't go out last week, many users may not be aware we're about to tear down the cluster. Let me send out an announcement and tear the cluster down tomorrow. Let's not rush it, please.

tumido commented 3 years ago

^ @larsks @ipolonsk

ipolonsk commented 3 years ago

I think this is a good idea to do it all together on the same day and have a proper timeslot for the deployment and the migrations (would love to know this process too), and have the time to prepare for that. I prefer Thursday afternoon(Israel time).

larsks commented 3 years ago

Schedule for Monday:

I have a 9AM meeting (US/Eastern), so I probably won't be working on the cluster re-install until after that. If you haven't been following https://github.com/open-infrastructure-labs/ops-issues/issues/36, we spent some time on Friday testing out ACM + baremetal on another cluster. We worked through a few problems that would otherwise have bitten us on Monday, so that was useful (we ultimately weren't successful but that was due to hardware issues).

I may shut down the CNV servers earlier on Monday and run a couple of tests (I want to verify that network device names look the same with the new version of RHCOS).

I'll drop a meet link into this issue when start the install if folks want to hang out.

tumido commented 3 years ago

@larsks we do want to do the re-install live and stream it for Operate First folks (https://github.com/operate-first/SRE/issues/104#issuecomment-789764485). Would you mind if I schedule the stream and sent the invite to you? (and all active participants as well?)

I'll schedule it for 10AM EST.

tumido commented 3 years ago

Update. I've created the event. @larsks @ipolonsk @HumairAK should be invited as active participants. Feel free to extend the [Presenters] invite to anybody else interested in active participation.

Stream link: https://stream.meet.google.com/stream/dfb0df94-d5a3-4230-a71e-67b2f46239f6 Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=MHU4ZDQ4bWJqYmVvdWo4ZmIyMmc4a3VvaGogdGNvdWZhbEByZWRoYXQuY29t&tmsrc=tcoufal%40redhat.com

larsks commented 3 years ago

The manifests that drive ACM are available in https://github.com/open-infrastructure-labs/zero-openshift-install. That's currently a private repository because it's essentially all secrets; I'm open to suggestions for opening it up somehow. Note that the master branch in that repository is currently generated from templates (in the feature/templates branch).

larsks commented 3 years ago

The cluster nodes are offline and I've updated dns so that the *.zero.massopen.cloud hostnames are active.

HumairAK commented 3 years ago

Ran into some issues with installing zero cluster, deployment still ongoing.

Issue 1 - libvirt uri for the cluster was missing in the baremetal cluster deployment config, this resulted in the provisioner trying to connect to a local socket instead of the appropriate uri

Issue 2 - Duplication of knownhosts configurations led to us not updating the known hosts in the right places (I think this was the cluster config itself, someone else can confirm), which in turn led to incorrect known hosts file being populated in the provisioner pod, resulting in host key verification failures.

HumairAK commented 3 years ago

Encountered hardware inspections issues with controller 2. The discovered disk does not match the root hint. suspected invalid root device hint.

HumairAK commented 3 years ago

Master nodes creation complete

time="2021-03-08T18:03:03Z" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[2]: Creation complete after 19m12s [id=dc27aa38-0b53-47d8-9712-55a9f433643e]"

time="2021-03-08T18:03:48Z" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[1]: Creation complete after 19m58s [id=03329e02-757a-45d9-8ed3-55ba3bbff879]"

time="2021-03-08T18:15:03Z" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[0]: Creation complete after 31m13s [id=b2ba293b-df86-4295-93ef-3c7f68d16095]"

HumairAK commented 3 years ago

Encountered:

time="2021-03-08T18:15:26Z" level=error
time="2021-03-08T18:15:26Z" level=error msg="Error: could not fetch data from user_data_url: GET https://192.12.185.104:22623/config/master giving up after 5 attempts"
time="2021-03-08T18:15:26Z" level=error
time="2021-03-08T18:15:26Z" level=error msg="  on ../tmp/openshift-install-080354864/masters/main.tf line 38, in resource \"ironic_deployment\" \"openshift-master-deployment\":"
time="2021-03-08T18:15:26Z" level=error msg="  38: resource \"ironic_deployment\" \"openshift-master-deployment\" {"
time="2021-03-08T18:15:26Z" level=error

larsks commented 3 years ago

Re: the error in the earlier comment, it's not clear if it's originating from:

Something on the ACM cluster, in which case there are some suspicious firewall rules in place that could be preventing the communication, or
Something on the bootstrap vm, or
Something running on one of the target baremetal hosts

The address 192.12.185.104 is the API VIP address for the cluster we're installing.

We were able to access that URL from the provisioning host, from the bootstrap vm, and from arbitrary locations on the internet.

larsks commented 3 years ago

I've opened https://access.redhat.com/support/cases/#/case/02887426 on the install failure.

larsks commented 3 years ago

Chris Doan points to https://access.redhat.com/solutions/5709711 which is relevant, but not a solution.

larsks commented 3 years ago

Dan Winship, author of https://github.com/openshift/origin/pull/22821, says:

Yes. You can’t install OCP inside OCP. This is a known problem. There is no supported workaround.

The port blocking was supposed to be temporary until MCS was rearchitected to make it unnecessary but that never happened.

I think there is an epic about fixing this in the MCS but I don’t have the link

So the suggestion here is that our install problem is a result of that PR, and that there's no way to make it functional right now. I'm going to try to reach out to ACM product management for an authoritative take on this issue.

larsks commented 3 years ago

https://github.com/openshift/enhancements/pull/626 appears to be an attempt to resolve the issue, but it's very new (2/3) and is only a proposal, not code.

larsks commented 3 years ago

There has been a response on the support case, which includes a pointer to https://bugzilla.redhat.com/show_bug.cgi?id=1936443 ("Hive based OCP IPI baremetal installation fails to connect to API VIP port 22623"). That's a high priority bug that was filed today.

larsks commented 3 years ago

Okay, so after thinking about this, we should be able to work around the problem.

Start the install
Wait for the hive container to start
Find out on which controller the hive container is running
Log into that controller, find the pid of the hive container
Use nsenter to enter the container namespace and remove the problematic firewall rules

I'd like to try that today. Do folks want to set up a call again?

tumido commented 3 years ago

Let's do it! :+1:

larsks commented 3 years ago

Questions:

[ ] Can we add node labels to install metadata? We need to apply a label to the storage nodes before their local disks are considered by the local storage operator.
[ ] We had problems deleting hosts that failed to provision. I was able to get them to delete by removing the finalizers, but this left node artifacts in Ironic. Is this a problem that would prevent successfully re-adding the nodes?
[ ] We didn't wipe the ceph drives before the re-install. Is there a way to identify volumes on disk that are not in use by the current openshift deployment (so that we can free up space)?
[ ] The workflow for adding new nodes has changed. When adding a new BMH, the cluster seems to automatically discover a new Node. After an approval workflow, the new node is part of the cluster, but it is not part of machineset and there is no associated machine resources. What's going on there?

tumido commented 3 years ago

OpenShift is reporting the missing nodes as "Discovered", we will click-provision them in the console:

larsks commented 3 years ago

@tumido yes, I've been working with those nodes, and actually opening a support case on some problems getting them added back to the cluster. Please don't do anything with them.

tumido commented 3 years ago

Sorry, we did :sob: We're sorry man, we clicked "Approve"

larsks commented 3 years ago

I am going to continue poking at those two nodes that were added post-install. Expect them to bounce a few times.

larsks commented 3 years ago

A slightly more formal write-up of our workaround:

Start the install and wait for the hive container to start.

Figure out on which node the install pod is running:

node=$(
oc get pod zero-0-rt5sp-provision-kjh8c -o json |
jq -r .spec.nodeName
)

Log into that node and get the pid of the hive installer:
```
oc debug node/$node
chroot /host
```

Get the id of the pod:

pod=$(crictl pods --name zero-0-rt5sp-provision-kjh8c -q)

Get the pid of a container running in that pod:

pid=$(
crictl ps -q --pod=$pod | head -1 |
xargs crictl inspect -o go-template --template '{{.info.pid}}'
)

Use nsenter to enter the network namespace of the pod and run iptables commands:

nsenter -t $pid -n -D FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
nsenter -t $pid -n -D FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
nsenter -t $pid -n -D OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
nsenter -t $pid -n -D OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable

We also removed the corresponding iptables rules in the global network namespace.

ipolonsk commented 3 years ago

@larsks can you point me to how I can add myself login permission to the argocd console? and an admin for the zero cluster?

open-infrastructure-labs / ops-issues

Re-install CNV cluster using ACM #21