rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.22k stars 2.94k forks source link

RKE2 cluster provisioning fails using the public golden RHEL 8.5 AWS AMI #36731

Open izaac opened 2 years ago

izaac commented 2 years ago

Rancher Server Setup

Information about the Cluster

User Information

Describe the bug

Cluster stays in Provisioning state, and never comes Active

To Reproduce

  1. Activate tech preview to use RKE2 from Rancher
  2. Select AWS
  3. Create a single node cluster or a multiple node cluster with different pools
  4. Configure the pool to have enough space for the nodes, use the RHEL 8.5 AMI in described previously user is ec2-user
  5. Security groups was open-all
  6. Root volume type I tried gp2 (which is the default) and gp3.
  7. Everything else related to RKE2 configuration use defaults, like the k8s version security, and Calico as CNI

Result

Cluster stuck in Provisioning state. Nodes show message (from yaml) message: 'provisioning bootstrap node(s) izb4-e-bbb88bb84-8rcdd: waiting for probes:

Events from local cluster show FailedMount events

And others FailedMount

Expected Result Be able to provision a Cluster using the RHEL 8.5 Golden public AMI from Rancher

Screenshots

Screen Shot 2022-03-03 at 8 57 19 AM

Additional context This is the original AMI, we have private AMIs with docker installed and networking services configuration and the cluster provisioning works. This same Image works when provisioning RKE1 clusters.

izaac commented 2 years ago

I've made it work with manual intervention by installing the rke2-selinux RPM and disable the network manager services preventing the rke2-server.service to start.

The selinux RPM install, https://github.com/rancher/rke2-selinux

sudo systemctl disable nm-cloud-setup.service nm-cloud-setup.timer

izaac commented 2 years ago

All this manual work shouldn't happen and Rancher/RKE2 should take care of it automatically.

brandond commented 2 years ago

I believe @Oats87 is working on selinux support for Rancher-provisioned clusters by allowing for local install of the RPMs instead of the tarball; it was not supported in the initial tech preview.

The nm-cloud-setup issue is interesting; we have taken the position that RKE2 shouldn't enable or disable other system services (and certainly shouldn't reboot the host, as required to disable nm-cloud-setup), and left it up to the administrator to read the documentation: https://docs.rke2.io/known_issues/#networkmanager

Perhaps rancher-system-agent can get away with being more hands-on with the system configuration.

snasovich commented 2 years ago

@izaac , would the https://github.com/rancher/rancher/issues/36509#issuecomment-1055621824 apply here as well? I know it's far from ideal, but installing RPMs and preparing nodes in a more automated fashion sounds like a not-so-small feature - and may not be feasible to support for all permutations (so we should probably focus supporting standard images first).

izaac commented 2 years ago

@snasovich totally, documenting it is an option I can review the docs once are ready for review, it has to be really visible and clear IMO. and so we can close this issue that case.

Thanks for following up

snasovich commented 2 years ago

This will need to be release noted and probably even added to support matrix.

@izaac , is there already an AMI that has necessary changes applied?

izaac commented 2 years ago

@snasovich correct I did the testing with a private AMI with the requirements here.

That made the cluster provisioning work when making it from the Rancher UI.

snasovich commented 2 years ago

@izaac , I was wondering if you could create an AMI that is based on RHEL 8.5 Golden public AMI + minimal changes needed for provisioning to work without manual intervention? We could then reference this AMI in documentation / release notes / support matrix.

izaac commented 2 years ago

@snasovich let me see if we can do that from the QA group im not sure if we have rights to publish public AMIs, I'll investigate

snasovich commented 2 years ago

We may want to improve this for 2.6.5 release, so moving to that milestone and removed release-note. If it's still working this way by 2.6.5, it will need to be release noted.

snasovich commented 2 years ago

We need to come up with an approach to address this and similar issues where additional packages are needed for these images.

snasovich commented 2 years ago

After discussion with @Oats87 @thedadams this will need to be release noted / mentioned on support matrix for 2.6.5 as the lift will be too big to start installing RPMs as part of provisioning. Moving to Blocked for now.