RKE version is not supported on the first run, gets fixed on the second run

iTaybb commented 3 years ago

I'm trying to deploy RKE v1.20.6-rancher1-1 with rancher v2.5.8, which should be supported by the release notes.

I'm getting the following error:

Error: RKE version is not supported [v1.20.5-rancher1-1 v1.19.10-rancher1-1 ................... ] got v1.20.6-rancher1-1

Weirdly enough, after re-running the terraform plan, it runs fine, so somehow the v1.20.6-rancher1-1 version is approved after some time.

Might be a race condition of some kind? Maybe rancher is not fully available yet?

iTaybb commented 3 years ago

It would seem that when rancher is bootstrapped, it takes some time for the rancher RKE images to become ready, so if you're using terraform to install the rancher instance, bootstrap it, and then attempt to create a cluster, the RKE images might not be ready yet.

By running curl -sSku $TOKEN https://$RANCHER_IP/v3/rkek8ssystemimages | jq -c '.pagination.total' right after bootstraping I can see:

10:06:19  rancher2_bootstrap.admin (local-exec): 122
10:06:22  rancher2_bootstrap.admin (local-exec): 143
10:06:24  rancher2_bootstrap.admin (local-exec): 163
10:06:27  rancher2_bootstrap.admin (local-exec): 168
10:06:29  rancher2_bootstrap.admin (local-exec): 168
10:06:30  rancher2_bootstrap.admin (local-exec): 168

which shows that the images are still loading.

I suggest that rancher2_bootstrap should check that all the rkek8ssystemimages are loaded through the API.

As a workaround, you can probably run some hacky script like this:

#!/bin/bash

LAST_LAST_COUNT=-1
LAST_COUNT=-1
while true; do
    COUNT=$(curl -sSku $TOKEN https://$RANCHER_IP/v3/rkek8ssystemimages | jq -c '.pagination.total')
    echo "$COUNT RKE images loaded."
    [[ $COUNT>0 && "$COUNT" == "$LAST_COUNT" && "$COUNT" == "$LAST_LAST_COUNT" ]] && exit 0
    LAST_LAST_COUNT=$LAST_COUNT
    LAST_COUNT=$COUNT
    sleep 1
done

rawmind0 commented 3 years ago

@iTaybb , yes, it seems a race condition between bootstrap is done and the local cluster is active. Fix added at PR #679, rancher2_bootstrap resource will wait until local cluster is active

rawmind0 commented 3 years ago

PR https://github.com/rancher/terraform-provider-rancher2/pull/679 is already merged. The fix will be available at next tf provider release.

Please, reopen issue if needed.

bashofmann commented 2 years ago

@rawmind0 Unfortunately this is still/again happening, see https://github.com/rancher/quickstart/issues/196. I can also reproduce this every 10th time or so.

iTaybb commented 2 years ago

The issue is happening again in rancher 2.6.3 and terraform provider v1.22.2.

phillamb168 commented 2 years ago

This may or may not work for you, but my fix was to do the following:

# Initialize Rancher server
resource "rancher2_bootstrap" "admin" {
  depends_on = [
    helm_release.rancher_server
  ]

  provider = rancher2.bootstrap

  password  = var.admin_password
  telemetry = true
}

locals {
  rke_network_plugin = "canal"
  rke_network_options = null
}

Then, add this:

resource "time_sleep" "wait_60_seconds" {
  depends_on = [rancher2_bootstrap.admin]
  create_duration = "60s"
}

and on the resource declaration for the workload:

# Create custom managed cluster for amf
resource "rancher2_cluster" "amf_workload" {
  depends_on = [time_sleep.wait_60_seconds]

rancher / terraform-provider-rancher2

RKE version is not supported on the first run, gets fixed on the second run #670