operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

Install new operate first cluster - 🐲 Smaug 🔥 #741

Closed larsks closed 2 years ago

larsks commented 3 years ago

This is tracking issue for the work involved in getting openshift installed on the new hardware. I'll update this shortly with information about the available hardware, etc.

larsks commented 3 years ago

@HumairAK @tumido please chime in with naming suggestions for the new cluster. Also, would you like this in the operate-first.cloud domain?

HumairAK commented 3 years ago

Operate first domain would be nice.

I propose we start naming all our clusters after characters from a creative universe like LOTR, and start by naming this one Gandalf. I would also love to be able to say things like "Sorry users, there's something wrong with gandalf right now, we're working on a fix".

Feel free to ignore this suggestion entirely, I'm terrible with names :)

Paging @durandom @4n4nd if you guys have other ideas.

larsks commented 3 years ago

We're sorry, Smaug isn't flying today.🐉

4n4nd commented 3 years ago

and start by naming this one Gandalf

I love Gandalf but I think this cluster should be dedicated to a Baggins for their sacrifice and also because the ring couldn't corrupt Frodo.

durandom commented 3 years ago

💯 for the operate-first.cloud DNS. @tumido do you have access to the required DNS stuff?

as for the naming, we have Rick from Rick'n'Morty in EMEA. But I also ❤️ 🧙 and 🐉 . Let's be an inclusive community and call the beast Smaug 🔥

tumido commented 3 years ago

:dragon_face: Smaug :fire:

@tumido approves

I wonder which of the workloads on the cluster will be our Arkenstone.

I can help with the DNS, tell me what you need. :slightly_smiling_face:

larsks commented 3 years ago

Okay, I've learned more about the network on which we are deploying these hosts so we can safely pre-allocate some addresses.

@tumido here's what we need to start with:

larsks commented 3 years ago

@durandom was asking about hardware. Here's what we have (also attached as a csv file:

controller
hostname           role        bmc_address    cpu_model                                    cpu_cores    disk    memory  manufacturer    model
-----------------  ----------  -------------  -----------------------------------------  -----------  ------  --------  --------------  --------------
oct-10-05-control  controller  10.3.10.5      Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz           48     447       384  Dell Inc.       PowerEdge R640
oct-10-04-control  controller  10.3.10.4      Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz           48     447       384  Dell Inc.       PowerEdge R640
oct-10-06-control  controller  10.3.10.6      Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz           48     447       384  Dell Inc.       PowerEdge R640

compute
hostname           role     bmc_address    cpu_model                                    cpu_cores    disk    memory  manufacturer    model
-----------------  -------  -------------  -----------------------------------------  -----------  ------  --------  --------------  ----------------
oct-03-31-compute  compute  10.3.3.31      Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz           32     372       256  Dell Inc.       PowerEdge R620
oct-03-13-compute  compute  10.3.3.13      Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-07-compute  compute  10.3.3.7       Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz           32     372       112  Dell Inc.       PowerEdge R620
oct-03-10-compute  compute  10.3.3.10      Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-03-compute  compute  10.3.3.3       Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-14-compute  compute  10.3.3.14      Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-15-compute  compute  10.3.3.15      Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-04-compute  compute  10.3.3.4       Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz           32     372       128  Dell Inc.       PowerEdge R620
oct-03-20-compute  compute  10.3.3.20      Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz           32     372       256  Dell Inc.       PowerEdge R620
oct-03-22-compute  compute  10.3.3.22      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-11-compute  compute  10.3.3.11      Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-26-compute  compute  10.3.3.26      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-00-compute  compute  10.3.3.0       Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       384  Dell Inc.       PowerEdge R720xd
oct-03-23-compute  compute  10.3.3.23      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-25-compute  compute  10.3.3.25      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-01-compute  compute  10.3.3.1       Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       384  Dell Inc.       PowerEdge R720xd
oct-03-24-compute  compute  10.3.3.24      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-09-compute  compute  10.3.3.9       Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-19-compute  compute  10.3.3.19      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-28-compute  compute  10.3.3.28      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-29-compute  compute  10.3.3.29      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-06-compute  compute  10.3.3.6       Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz           32     372       128  Dell Inc.       PowerEdge R620
oct-03-08-compute  compute  10.3.3.8       Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz           32     372       128  Dell Inc.       PowerEdge R620
oct-03-12-compute  compute  10.3.3.12      Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
oct-03-27-compute  compute  10.3.3.27      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-21-compute  compute  10.3.3.21      Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       256  Dell Inc.       PowerEdge R620
oct-03-17-compute  compute  10.3.3.17      Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       256  Dell Inc.       PowerEdge R620
oct-03-02-compute  compute  10.3.3.2       Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz           40     372       384  Dell Inc.       PowerEdge R720xd
oct-03-05-compute  compute  10.3.3.5       Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz            32     372       128  Dell Inc.       PowerEdge R620
HumairAK commented 3 years ago

Smaug is a fitting name for this beast

4n4nd commented 3 years ago

this looks awesome! :star_struck:

larsks commented 3 years ago

After the power outage it looks as if a switch that controls access to the compute BMC (system management) ports didn't come back up correctly, so I'm blocked on further progress until that is resolved. The MOC will have someone out at the data center tomorrow to remediate the switches. If that restores access, we're all set to move forward. If not, I may end up driving out myself in order to fiddle with iDRAC configuration on the system consoles. I'll update this issue tomorrow to let you know where things stand.

larsks commented 3 years ago

Switch problem was corrected on Friday, but we're still hitting what looking like an installer bug. Currently engaging with Nir on the assisted installer team around https://bugzilla.redhat.com/show_bug.cgi?id=1991738.

larsks commented 3 years ago

We've turned that into a new bug based on a bunch of time spent in diagnostics this morning: https://bugzilla.redhat.com/show_bug.cgi?id=1994657

larsks commented 3 years ago

The cluster is up! Attached to this comment are a kubeconfig file and the kubeadmin password, both gpg encrypted to (using your redhat address):

@HumairAK I couldn't find a key for either your redhat or hotmail addresses (searching https://keys.openpgp.org/).

There were a couple of workers that failed to boot during the install, we have at the moment:

$ oc get node
NAME                STATUS   ROLES    AGE    VERSION
oct-03-00-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-01-compute   Ready    worker   85m    v1.21.1+051ac4f
oct-03-03-compute   Ready    worker   86m    v1.21.1+051ac4f
oct-03-04-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-07-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-08-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-09-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-10-compute   Ready    worker   86m    v1.21.1+051ac4f
oct-03-11-compute   Ready    worker   85m    v1.21.1+051ac4f
oct-03-12-compute   Ready    worker   88m    v1.21.1+051ac4f
oct-03-13-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-14-compute   Ready    worker   88m    v1.21.1+051ac4f
oct-03-15-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-17-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-19-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-20-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-21-compute   Ready    worker   89m    v1.21.1+051ac4f
oct-03-22-compute   Ready    worker   89m    v1.21.1+051ac4f
oct-03-23-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-03-24-compute   Ready    worker   88m    v1.21.1+051ac4f
oct-03-25-compute   Ready    worker   88m    v1.21.1+051ac4f
oct-03-26-compute   Ready    worker   89m    v1.21.1+051ac4f
oct-03-27-compute   Ready    worker   88m    v1.21.1+051ac4f
oct-03-28-compute   Ready    worker   89m    v1.21.1+051ac4f
oct-03-29-compute   Ready    worker   88m    v1.21.1+051ac4f
oct-03-31-compute   Ready    worker   87m    v1.21.1+051ac4f
oct-10-04-control   Ready    master   107m   v1.21.1+051ac4f
oct-10-05-control   Ready    master   88m    v1.21.1+051ac4f
oct-10-06-control   Ready    master   107m   v1.21.1+051ac4f

The console url is https://console-openshift-console.apps.smaug.na.operate-first.cloud/.

billburnseh commented 3 years ago

Woot! Congrats! Great work as usual!

Cheers Bill

On Wed, Aug 18, 2021, 7:44 PM Lars Kellogg-Stedman @.***> wrote:

The cluster is up! Attached to this comment are a kubeconfig https://github.com/operate-first/SRE/files/7010742/kubeconfig.gpg.txt file and the kubeadmin password https://github.com/operate-first/SRE/files/7010746/kubeadmin-password.gpg.txt, both gpg encrypted to (using your redhat address):

@HumairAK https://github.com/HumairAK I couldn't find a key for either your redhat or hotmail addresses (searching https://keys.openpgp.org/).

There were a couple of workers that failed to boot during the install, we have at the moment:

$ oc get node NAME STATUS ROLES AGE VERSION oct-03-00-compute Ready worker 87m v1.21.1+051ac4f oct-03-01-compute Ready worker 85m v1.21.1+051ac4f oct-03-03-compute Ready worker 86m v1.21.1+051ac4f oct-03-04-compute Ready worker 87m v1.21.1+051ac4f oct-03-07-compute Ready worker 87m v1.21.1+051ac4f oct-03-08-compute Ready worker 87m v1.21.1+051ac4f oct-03-09-compute Ready worker 87m v1.21.1+051ac4f oct-03-10-compute Ready worker 86m v1.21.1+051ac4f oct-03-11-compute Ready worker 85m v1.21.1+051ac4f oct-03-12-compute Ready worker 88m v1.21.1+051ac4f oct-03-13-compute Ready worker 87m v1.21.1+051ac4f oct-03-14-compute Ready worker 88m v1.21.1+051ac4f oct-03-15-compute Ready worker 87m v1.21.1+051ac4f oct-03-17-compute Ready worker 87m v1.21.1+051ac4f oct-03-19-compute Ready worker 87m v1.21.1+051ac4f oct-03-20-compute Ready worker 87m v1.21.1+051ac4f oct-03-21-compute Ready worker 89m v1.21.1+051ac4f oct-03-22-compute Ready worker 89m v1.21.1+051ac4f oct-03-23-compute Ready worker 87m v1.21.1+051ac4f oct-03-24-compute Ready worker 88m v1.21.1+051ac4f oct-03-25-compute Ready worker 88m v1.21.1+051ac4f oct-03-26-compute Ready worker 89m v1.21.1+051ac4f oct-03-27-compute Ready worker 88m v1.21.1+051ac4f oct-03-28-compute Ready worker 89m v1.21.1+051ac4f oct-03-29-compute Ready worker 88m v1.21.1+051ac4f oct-03-31-compute Ready worker 87m v1.21.1+051ac4f oct-10-04-control Ready master 107m v1.21.1+051ac4f oct-10-05-control Ready master 88m v1.21.1+051ac4f oct-10-06-control Ready master 107m v1.21.1+051ac4f

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/operate-first/SRE/issues/359#issuecomment-901501059, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKLLPVXHHD6LNAO7FFRZJDT5RAV3ANCNFSM5BRGXIPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

larsks commented 3 years ago

I'm running a cluster update right now to bring it up to the latest 4.8.x release (4.8.4).

4n4nd commented 3 years ago

This is great news! 🎉

larsks commented 3 years ago

Some systems seem to get stuck rebooting; see https://github.com/CCI-MOC/ops-issues/issues/341. I'll see whats involved in disabling pxe on these nics tomorrow.

HumairAK commented 3 years ago

@larsks : https://keys.openpgp.org/search?q=humair88%40hotmail.com

larsks commented 3 years ago

@HumairAK I re-encrypted the files in the earlier comment.

larsks commented 2 years ago

Smaug is up and running and we're no longer blocked on storage (see https://github.com/CCI-MOC/ops-issues/issues/390 for recent details), so I declare this cluster viable.