Closed larsks closed 2 years ago
@HumairAK @tumido please chime in with naming suggestions for the new cluster. Also, would you like this in the operate-first.cloud
domain?
Operate first domain would be nice.
I propose we start naming all our clusters after characters from a creative universe like LOTR, and start by naming this one Gandalf. I would also love to be able to say things like "Sorry users, there's something wrong with gandalf right now, we're working on a fix".
Feel free to ignore this suggestion entirely, I'm terrible with names :)
Paging @durandom @4n4nd if you guys have other ideas.
We're sorry, Smaug isn't flying today.🐉
and start by naming this one Gandalf
I love Gandalf but I think this cluster should be dedicated to a Baggins for their sacrifice and also because the ring couldn't corrupt Frodo.
💯 for the operate-first.cloud
DNS. @tumido do you have access to the required DNS stuff?
as for the naming, we have Rick from Rick'n'Morty in EMEA. But I also ❤️ 🧙 and 🐉 . Let's be an inclusive community and call the beast Smaug 🔥
:dragon_face: Smaug :fire:
@tumido approves
I wonder which of the workloads on the cluster will be our Arkenstone.
I can help with the DNS, tell me what you need. :slightly_smiling_face:
Okay, I've learned more about the network on which we are deploying these hosts so we can safely pre-allocate some addresses.
@tumido here's what we need to start with:
@durandom was asking about hardware. Here's what we have (also attached as a csv file:
controller
hostname role bmc_address cpu_model cpu_cores disk memory manufacturer model
----------------- ---------- ------------- ----------------------------------------- ----------- ------ -------- -------------- --------------
oct-10-05-control controller 10.3.10.5 Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz 48 447 384 Dell Inc. PowerEdge R640
oct-10-04-control controller 10.3.10.4 Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz 48 447 384 Dell Inc. PowerEdge R640
oct-10-06-control controller 10.3.10.6 Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz 48 447 384 Dell Inc. PowerEdge R640
compute
hostname role bmc_address cpu_model cpu_cores disk memory manufacturer model
----------------- ------- ------------- ----------------------------------------- ----------- ------ -------- -------------- ----------------
oct-03-31-compute compute 10.3.3.31 Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz 32 372 256 Dell Inc. PowerEdge R620
oct-03-13-compute compute 10.3.3.13 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-07-compute compute 10.3.3.7 Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz 32 372 112 Dell Inc. PowerEdge R620
oct-03-10-compute compute 10.3.3.10 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-03-compute compute 10.3.3.3 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-14-compute compute 10.3.3.14 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-15-compute compute 10.3.3.15 Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-04-compute compute 10.3.3.4 Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-20-compute compute 10.3.3.20 Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz 32 372 256 Dell Inc. PowerEdge R620
oct-03-22-compute compute 10.3.3.22 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-11-compute compute 10.3.3.11 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-26-compute compute 10.3.3.26 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-00-compute compute 10.3.3.0 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 384 Dell Inc. PowerEdge R720xd
oct-03-23-compute compute 10.3.3.23 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-25-compute compute 10.3.3.25 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-01-compute compute 10.3.3.1 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 384 Dell Inc. PowerEdge R720xd
oct-03-24-compute compute 10.3.3.24 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-09-compute compute 10.3.3.9 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-19-compute compute 10.3.3.19 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-28-compute compute 10.3.3.28 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-29-compute compute 10.3.3.29 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-06-compute compute 10.3.3.6 Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-08-compute compute 10.3.3.8 Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-12-compute compute 10.3.3.12 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
oct-03-27-compute compute 10.3.3.27 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-21-compute compute 10.3.3.21 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 256 Dell Inc. PowerEdge R620
oct-03-17-compute compute 10.3.3.17 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 256 Dell Inc. PowerEdge R620
oct-03-02-compute compute 10.3.3.2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 40 372 384 Dell Inc. PowerEdge R720xd
oct-03-05-compute compute 10.3.3.5 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 32 372 128 Dell Inc. PowerEdge R620
Smaug is a fitting name for this beast
this looks awesome! :star_struck:
After the power outage it looks as if a switch that controls access to the compute BMC (system management) ports didn't come back up correctly, so I'm blocked on further progress until that is resolved. The MOC will have someone out at the data center tomorrow to remediate the switches. If that restores access, we're all set to move forward. If not, I may end up driving out myself in order to fiddle with iDRAC configuration on the system consoles. I'll update this issue tomorrow to let you know where things stand.
Switch problem was corrected on Friday, but we're still hitting what looking like an installer bug. Currently engaging with Nir on the assisted installer team around https://bugzilla.redhat.com/show_bug.cgi?id=1991738.
We've turned that into a new bug based on a bunch of time spent in diagnostics this morning: https://bugzilla.redhat.com/show_bug.cgi?id=1994657
The cluster is up! Attached to this comment are a kubeconfig file and the kubeadmin password, both gpg encrypted to (using your redhat address):
@HumairAK I couldn't find a key for either your redhat or hotmail addresses (searching https://keys.openpgp.org/).
There were a couple of workers that failed to boot during the install, we have at the moment:
$ oc get node
NAME STATUS ROLES AGE VERSION
oct-03-00-compute Ready worker 87m v1.21.1+051ac4f
oct-03-01-compute Ready worker 85m v1.21.1+051ac4f
oct-03-03-compute Ready worker 86m v1.21.1+051ac4f
oct-03-04-compute Ready worker 87m v1.21.1+051ac4f
oct-03-07-compute Ready worker 87m v1.21.1+051ac4f
oct-03-08-compute Ready worker 87m v1.21.1+051ac4f
oct-03-09-compute Ready worker 87m v1.21.1+051ac4f
oct-03-10-compute Ready worker 86m v1.21.1+051ac4f
oct-03-11-compute Ready worker 85m v1.21.1+051ac4f
oct-03-12-compute Ready worker 88m v1.21.1+051ac4f
oct-03-13-compute Ready worker 87m v1.21.1+051ac4f
oct-03-14-compute Ready worker 88m v1.21.1+051ac4f
oct-03-15-compute Ready worker 87m v1.21.1+051ac4f
oct-03-17-compute Ready worker 87m v1.21.1+051ac4f
oct-03-19-compute Ready worker 87m v1.21.1+051ac4f
oct-03-20-compute Ready worker 87m v1.21.1+051ac4f
oct-03-21-compute Ready worker 89m v1.21.1+051ac4f
oct-03-22-compute Ready worker 89m v1.21.1+051ac4f
oct-03-23-compute Ready worker 87m v1.21.1+051ac4f
oct-03-24-compute Ready worker 88m v1.21.1+051ac4f
oct-03-25-compute Ready worker 88m v1.21.1+051ac4f
oct-03-26-compute Ready worker 89m v1.21.1+051ac4f
oct-03-27-compute Ready worker 88m v1.21.1+051ac4f
oct-03-28-compute Ready worker 89m v1.21.1+051ac4f
oct-03-29-compute Ready worker 88m v1.21.1+051ac4f
oct-03-31-compute Ready worker 87m v1.21.1+051ac4f
oct-10-04-control Ready master 107m v1.21.1+051ac4f
oct-10-05-control Ready master 88m v1.21.1+051ac4f
oct-10-06-control Ready master 107m v1.21.1+051ac4f
The console url is https://console-openshift-console.apps.smaug.na.operate-first.cloud/.
Woot! Congrats! Great work as usual!
Cheers Bill
On Wed, Aug 18, 2021, 7:44 PM Lars Kellogg-Stedman @.***> wrote:
The cluster is up! Attached to this comment are a kubeconfig https://github.com/operate-first/SRE/files/7010742/kubeconfig.gpg.txt file and the kubeadmin password https://github.com/operate-first/SRE/files/7010746/kubeadmin-password.gpg.txt, both gpg encrypted to (using your redhat address):
- Me
- @tcoufal https://github.com/tcoufal
- @4n4nd https://github.com/4n4nd
@HumairAK https://github.com/HumairAK I couldn't find a key for either your redhat or hotmail addresses (searching https://keys.openpgp.org/).
There were a couple of workers that failed to boot during the install, we have at the moment:
$ oc get node NAME STATUS ROLES AGE VERSION oct-03-00-compute Ready worker 87m v1.21.1+051ac4f oct-03-01-compute Ready worker 85m v1.21.1+051ac4f oct-03-03-compute Ready worker 86m v1.21.1+051ac4f oct-03-04-compute Ready worker 87m v1.21.1+051ac4f oct-03-07-compute Ready worker 87m v1.21.1+051ac4f oct-03-08-compute Ready worker 87m v1.21.1+051ac4f oct-03-09-compute Ready worker 87m v1.21.1+051ac4f oct-03-10-compute Ready worker 86m v1.21.1+051ac4f oct-03-11-compute Ready worker 85m v1.21.1+051ac4f oct-03-12-compute Ready worker 88m v1.21.1+051ac4f oct-03-13-compute Ready worker 87m v1.21.1+051ac4f oct-03-14-compute Ready worker 88m v1.21.1+051ac4f oct-03-15-compute Ready worker 87m v1.21.1+051ac4f oct-03-17-compute Ready worker 87m v1.21.1+051ac4f oct-03-19-compute Ready worker 87m v1.21.1+051ac4f oct-03-20-compute Ready worker 87m v1.21.1+051ac4f oct-03-21-compute Ready worker 89m v1.21.1+051ac4f oct-03-22-compute Ready worker 89m v1.21.1+051ac4f oct-03-23-compute Ready worker 87m v1.21.1+051ac4f oct-03-24-compute Ready worker 88m v1.21.1+051ac4f oct-03-25-compute Ready worker 88m v1.21.1+051ac4f oct-03-26-compute Ready worker 89m v1.21.1+051ac4f oct-03-27-compute Ready worker 88m v1.21.1+051ac4f oct-03-28-compute Ready worker 89m v1.21.1+051ac4f oct-03-29-compute Ready worker 88m v1.21.1+051ac4f oct-03-31-compute Ready worker 87m v1.21.1+051ac4f oct-10-04-control Ready master 107m v1.21.1+051ac4f oct-10-05-control Ready master 88m v1.21.1+051ac4f oct-10-06-control Ready master 107m v1.21.1+051ac4f
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/operate-first/SRE/issues/359#issuecomment-901501059, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKLLPVXHHD6LNAO7FFRZJDT5RAV3ANCNFSM5BRGXIPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
I'm running a cluster update right now to bring it up to the latest 4.8.x release (4.8.4).
This is great news! 🎉
Some systems seem to get stuck rebooting; see https://github.com/CCI-MOC/ops-issues/issues/341. I'll see whats involved in disabling pxe on these nics tomorrow.
@HumairAK I re-encrypted the files in the earlier comment.
Smaug is up and running and we're no longer blocked on storage (see https://github.com/CCI-MOC/ops-issues/issues/390 for recent details), so I declare this cluster viable.
This is tracking issue for the work involved in getting openshift installed on the new hardware. I'll update this shortly with information about the available hardware, etc.