Closed larsks closed 3 years ago
This will be the first BM cluster deployment using the ACM moc-infra, we can try the deployment on 3 other BM servers before tearing apart the CNV cluster. We have 2 other BM servers available in the old CNV2 ( 3 were taken to moc-infra).
@ipolonsk I would be more inclined to just tear down the cnv cluster and rebuild it. That's pretty much what it's there for.
@larsks Agree, I would like to do it together, can we schedule an hour/hour and a half to do it together?
@ipolonsk how about @ 10:30 Eastern? That's about 40 minutes from now.
I'd say we wait with it till tomorrow and plan on it properly. Let's schedule a proper timeslot so me and @HumairAK can join as well. We need to learn this as well.
Also it's not possible for me to migrate current CNV setup properly to a new cluster today (it's already 5PM over here, if we start installing right now it'll be 7PM for me to start monitoring the migrations) and @HumairAK's on PTO.
And since the email didn't go out last week, many users may not be aware we're about to tear down the cluster. Let me send out an announcement and tear the cluster down tomorrow. Let's not rush it, please.
^ @larsks @ipolonsk
I think this is a good idea to do it all together on the same day and have a proper timeslot for the deployment and the migrations (would love to know this process too), and have the time to prepare for that. I prefer Thursday afternoon(Israel time).
Schedule for Monday:
I have a 9AM meeting (US/Eastern), so I probably won't be working on the cluster re-install until after that. If you haven't been following https://github.com/open-infrastructure-labs/ops-issues/issues/36, we spent some time on Friday testing out ACM + baremetal on another cluster. We worked through a few problems that would otherwise have bitten us on Monday, so that was useful (we ultimately weren't successful but that was due to hardware issues).
I may shut down the CNV servers earlier on Monday and run a couple of tests (I want to verify that network device names look the same with the new version of RHCOS).
I'll drop a meet link into this issue when start the install if folks want to hang out.
@larsks we do want to do the re-install live and stream it for Operate First folks (https://github.com/operate-first/SRE/issues/104#issuecomment-789764485). Would you mind if I schedule the stream and sent the invite to you? (and all active participants as well?)
I'll schedule it for 10AM EST.
Update. I've created the event. @larsks @ipolonsk @HumairAK should be invited as active participants. Feel free to extend the [Presenters]
invite to anybody else interested in active participation.
Stream link: https://stream.meet.google.com/stream/dfb0df94-d5a3-4230-a71e-67b2f46239f6 Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=MHU4ZDQ4bWJqYmVvdWo4ZmIyMmc4a3VvaGogdGNvdWZhbEByZWRoYXQuY29t&tmsrc=tcoufal%40redhat.com
The manifests that drive ACM are available in https://github.com/open-infrastructure-labs/zero-openshift-install. That's currently a private repository because it's essentially all secrets; I'm open to suggestions for opening it up somehow. Note that the master
branch in that repository is currently generated from templates (in the feature/templates
branch).
The cluster nodes are offline and I've updated dns so that the *.zero.massopen.cloud
hostnames are active.
Ran into some issues with installing zero cluster, deployment still ongoing.
Issue 1 - libvirt uri for the cluster was missing in the baremetal cluster deployment config, this resulted in the provisioner trying to connect to a local socket instead of the appropriate uri
Issue 2 - Duplication of knownhosts configurations led to us not updating the known hosts in the right places (I think this was the cluster config itself, someone else can confirm), which in turn led to incorrect known hosts file being populated in the provisioner pod, resulting in host key verification failures.
Encountered hardware inspections issues with controller 2. The discovered disk does not match the root hint. suspected invalid root device hint.
Master nodes creation complete
time="2021-03-08T18:03:03Z" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[2]: Creation complete after 19m12s [id=dc27aa38-0b53-47d8-9712-55a9f433643e]"
time="2021-03-08T18:03:48Z" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[1]: Creation complete after 19m58s [id=03329e02-757a-45d9-8ed3-55ba3bbff879]"
time="2021-03-08T18:15:03Z" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[0]: Creation complete after 31m13s [id=b2ba293b-df86-4295-93ef-3c7f68d16095]"
Encountered:
time="2021-03-08T18:15:26Z" level=error
time="2021-03-08T18:15:26Z" level=error msg="Error: could not fetch data from user_data_url: GET https://192.12.185.104:22623/config/master giving up after 5 attempts"
time="2021-03-08T18:15:26Z" level=error
time="2021-03-08T18:15:26Z" level=error msg=" on ../tmp/openshift-install-080354864/masters/main.tf line 38, in resource \"ironic_deployment\" \"openshift-master-deployment\":"
time="2021-03-08T18:15:26Z" level=error msg=" 38: resource \"ironic_deployment\" \"openshift-master-deployment\" {"
time="2021-03-08T18:15:26Z" level=error
Re: the error in the earlier comment, it's not clear if it's originating from:
The address 192.12.185.104 is the API VIP address for the cluster we're installing.
We were able to access that URL from the provisioning host, from the bootstrap vm, and from arbitrary locations on the internet.
I've opened https://access.redhat.com/support/cases/#/case/02887426 on the install failure.
Chris Doan points to https://access.redhat.com/solutions/5709711 which is relevant, but not a solution.
Dan Winship, author of https://github.com/openshift/origin/pull/22821, says:
Yes. You can’t install OCP inside OCP. This is a known problem. There is no supported workaround.
The port blocking was supposed to be temporary until MCS was rearchitected to make it unnecessary but that never happened.
I think there is an epic about fixing this in the MCS but I don’t have the link
So the suggestion here is that our install problem is a result of that PR, and that there's no way to make it functional right now. I'm going to try to reach out to ACM product management for an authoritative take on this issue.
https://github.com/openshift/enhancements/pull/626 appears to be an attempt to resolve the issue, but it's very new (2/3) and is only a proposal, not code.
There has been a response on the support case, which includes a pointer to https://bugzilla.redhat.com/show_bug.cgi?id=1936443 ("Hive based OCP IPI baremetal installation fails to connect to API VIP port 22623"). That's a high priority bug that was filed today.
Okay, so after thinking about this, we should be able to work around the problem.
nsenter
to enter the container namespace and remove
the problematic firewall rulesI'd like to try that today. Do folks want to set up a call again?
Let's do it! :+1:
Questions:
finalizers
, but this left node artifacts in Ironic. Is this a problem that would prevent successfully re-adding the nodes?OpenShift is reporting the missing nodes as "Discovered", we will click-provision them in the console:
@tumido yes, I've been working with those nodes, and actually opening a support case on some problems getting them added back to the cluster. Please don't do anything with them.
Sorry, we did :sob: We're sorry man, we clicked "Approve"
I am going to continue poking at those two nodes that were added post-install. Expect them to bounce a few times.
A slightly more formal write-up of our workaround:
Start the install and wait for the hive container to start.
Figure out on which node the install pod is running:
node=$(
oc get pod zero-0-rt5sp-provision-kjh8c -o json |
jq -r .spec.nodeName
)
Log into that node and get the pid of the hive installer:
oc debug node/$node
chroot /host
Get the id of the pod:
pod=$(crictl pods --name zero-0-rt5sp-provision-kjh8c -q)
Get the pid of a container running in that pod:
pid=$(
crictl ps -q --pod=$pod | head -1 |
xargs crictl inspect -o go-template --template '{{.info.pid}}'
)
Use nsenter
to enter the network namespace of the pod and run
iptables
commands:
nsenter -t $pid -n -D FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
nsenter -t $pid -n -D FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
nsenter -t $pid -n -D OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
nsenter -t $pid -n -D OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
We also removed the corresponding iptables rules in the global network namespace.
@larsks can you point me to how I can add myself login permission to the argocd console? and an admin for the zero cluster?
As discussed here https://github.com/operate-first/SRE/issues/62 we're gonna reinstall via ACM and upgrade to 4.7 directly