kubestellar / kubeflex

A flexible and scalable platform for running Kubernetes control plane APIs.
Apache License 2.0
49 stars 13 forks source link

bug: open-cluster-management in kind-kubeflex often fails to exist #292

Closed clubanderson closed 3 weeks ago

clubanderson commented 1 month ago

Describe the bug

When using the demo env script exhibited below, I get the following NS' when the script is waiting for OCM clients to create a connection with OCM hub

kubectl --context kind-kubeflex get ns                                                       
NAME                 STATUS   AGE
default              Active   8m46s
ingress-nginx        Active   8m40s
its1-system          Active   5m15s
kube-node-lease      Active   8m46s
kube-public          Active   8m46s
kube-system          Active   8m46s
kubeflex-system      Active   7m12s
local-path-storage   Active   8m41s
wds1-system          Active   5m9s

script has the following line:

wait-for-cmd '(($(wrap-cmd kubectl --context kind-kubeflex get deployments.apps -n open-cluster-management -o jsonpath='{.status.readyReplicas}' cluster-manager 2>/dev/null || echo 0) >= 1))'

often times there is no open-cluster-management NS available

Steps To Reproduce

  1. run demo env bash script:
#!/bin/bash                                                                                                                   

##########################################
function wait-for-cmd() (
    cmd="$@"
    wait_counter=0
    while ! (eval "$cmd") ; do
        if (($wait_counter > 100)); then
            echo "Failed to ${cmd}."
            exit 1
        fi
        ((wait_counter += 1))
        sleep 5
    done
)
##########################################

kind delete cluster --name kubeflex
kind delete cluster --name cluster1
kind delete cluster --name cluster2
kubectl config delete-context kind-kubeflex
kubectl config delete-context cluster1
kubectl config delete-context cluster2

export KUBESTELLAR_VERSION=0.24.0

bash <(curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/v0.24.0/scripts/create-kind-cluster-with-SSL-passthrough.sh) --name kubeflex --port 9443

helm upgrade --install ks-core oci://ghcr.io/kubestellar/kubestellar/core-chart \
    --version $KUBESTELLAR_VERSION \
    --set-json='ITSes=[{"name":"its1"}]' \
    --set-json='WDSes=[{"name":"wds1"}]'

kubectl config delete-context its1 || true
kflex ctx its1
kubectl config delete-context wds1 || true
kflex ctx wds1

: wait for OCM cluster manager up
echo OCM time
wait-for-cmd '(($(wrap-cmd kubectl --context kind-kubeflex get deployments.apps -n open-cluster-management -o jsonpath='{.status.readyReplicas}' cluster-manager 2>/dev/null || echo 0) >= 1))'

: set flags to "" if you have installed KubeStellar on an OpenShift cluster
flags="--force-internal-endpoint-lookup"
clusters=(cluster1 cluster2);
for cluster in "${clusters[@]}"; do
   kind create cluster --name ${cluster}
   kubectl config rename-context kind-${cluster} ${cluster}
   clusteradm --context its1 get token | grep '^clusteradm join' | sed "s/<cluster_name>/${cluster}/" | awk '{print $0 " --context '${cluster}' --singleton '${flags}'"}' | sh
done

watch kubectl --context its1 get csr
  1. wait for section on OCM to start
  2. check for open-cluster-management ns with 'kubectl --context kind-kubeflex get ns'
  3. observe ns open-cluster-management is missing

Expected Behavior

if open-cluster-management ns is mandatory, more checks should be put in place inside kubeflex to ensure it's existence before marking kubeflex as successfully installed.

Additional Context

No response

pdettori commented 1 month ago

@clubanderson I see now that the wait-for-cmd command has the wrong context for checking the presence of the OCM namespace (my bad). Let me update the script so you can try again.

clubanderson commented 1 month ago

no need to, I just added a switch to its1. wait-for-cmd can stay generic.

I added the last line here to switch back to its1 before running OCM items - worked on the first try - will continue to try a few more times:

kubectl config delete-context its1 || true
kflex ctx its1
kubectl config delete-context wds1 || true
kflex ctx wds1
kflex ctx its1
pdettori commented 1 month ago

@clubanderson I did some more improvement on the script, you are welcome to try the latest attached here: test-ks-install.sh.zip

MikeSpreitzer commented 1 month ago

@clubanderson: Looking at my copy of the script, that additional line should make no difference because the "its1" context has already been created (so switching will be fast, require no waiting for anything or even creating anything) and the later commands are not sensitive to the current context at that point. So I suspect you are seeing a coincidence.

MikeSpreitzer commented 1 month ago

@pdettori: I still think that set -e would help too.

I am not clear on what that awk after wc -l is doing.

mspreitz@mjs13 test4 % echo 5 | awk '{$1=$1};1'
5
mspreitz@mjs13 test4 % echo | awk '{$1=$1};1' 

mspreitz@mjs13 test4 % 

That script is still vulnerable to junk in the kubeflex extension in the kubeconfig file. I would add the preventative from https://docs.kubestellar.io/unreleased-development/direct/get-started/#cleanup-from-previous-runs

I would use the latest test release of KubeStellar, it is less buggy than the latest regular release.

In the Helm command, I would include the verbosity setting as in https://docs.kubestellar.io/unreleased-development/direct/start-from-ocm/#use-core-helm-chart-to-initialize-kubeflex-recognize-its-and-create-wds , particularly since we think that we might be betting a problem report back.

pdettori commented 1 month ago

@MikeSpreitzer - thanks - regarding your question

I am not clear on what that awk after wc -l is doing.

it removes leading and trailing spaces, which were showing up when doing that command.

$ echo "  5  " | awk '{$1=$1};1'
5
clubanderson commented 1 month ago

ok, I have it working now with

#!/bin/bash
# Copyright 2024 The KubeStellar Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Deploys Kubestellar environment for demo purposes.

##########################################
function wait-for-cmd() (
    cmd="$@"
    wait_counter=0
    while ! (eval "$cmd") ; do
        if (($wait_counter > 100)); then
            echo "Failed to ${cmd}."
            exit 1
        fi
        ((wait_counter += 1))
        sleep 5
    done
)

cluster_clean_up() {
    error_message=$(eval "$1" 2>&1)
    if [ $? -ne 0 ]; then
        echo "clean up failed. Error:"
        echo "$error_message"
    fi
}

context_clean_up() {
    output=$(kubectl config get-contexts -o name)

    while IFS= read -r line; do
        if [ "$line" == "kind-kubeflex" ]; then 
            echo "Deleting kind-kubeflex context..."
            kubectl config delete-context kind-kubeflex

        elif [ "$line" == "cluster1" ]; then
            echo "Deleting cluster1 context..."
            kubectl config delete-context cluster1

        elif [ "$line" == "cluster2" ]; then
            echo "Deleting cluster2 context..."
            kubectl config delete-context cluster2

        elif [ "$line" == "its1" ]; then
            echo "Deleting its1 context..."
            kubectl config delete-context its1

        elif [ "$line" == "wds1" ]; then
            echo "Deleting wds1 context..."
            kubectl config delete-context wds1

        fi

    done <<< "$output"
}

checking_cluster() {
    found=false

    while true; do

        output=$(kubectl --context its1 get csr)

        while IFS= read -r line; do

            if echo "$line" | grep -q $1; then
                echo "$1 has been found, approving CSR"
                clusteradm --context its1 accept --clusters "$1"
                found=true
                break
            fi

        done <<< "$output"

        if [ "$found" = true ]; then
            break

        else
            echo "$1 not found. Trying again..."
            sleep 5
        fi

    done
}
##########################################

set -e

export KUBESTELLAR_VERSION=0.24.0
echo -e "KubeStellar Version: ${KUBESTELLAR_VERSION}"

echo -e "Checking that pre-req softwares are installed..."

curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/v${KUBESTELLAR_VERSION}/hack/check_pre_req.sh | bash -s -- -V kflex ocm helm kubectl docker kind

output=$(curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/v${KUBESTELLAR_VERSION}/hack/check_pre_req.sh | bash -s -- -A -V kflex ocm helm kubectl docker kind
)

echo -e "\nStarting environment clean up..."
echo -e "Starting cluster clean up..."

cluster_clean_up "kind delete cluster --name kubeflex"
cluster_clean_up "kind delete cluster --name cluster1"
cluster_clean_up "kind delete cluster --name cluster2"
echo -e "Cluster space clean up has been completed"

echo -e "\nStarting context clean up..."

context_clean_up
echo "Context space clean up completed"

echo -e "\nStarting the process to install KubeStellar core: kind-kubeflex..."

curl -s https://raw.githubusercontent.com/kubestellar/kubestellar/v${KUBESTELLAR_VERSION}/scripts/create-kind-cluster-with-SSL-passthrough.sh | bash -s -- --name kubeflex --port 9443

helm upgrade --install ks-core oci://ghcr.io/kubestellar/kubestellar/core-chart \
    --version $KUBESTELLAR_VERSION \
    --set-json='ITSes=[{"name":"its1"}]' \
    --set-json='WDSes=[{"name":"wds1"}]'

kubectl config delete-context its1 || true
kflex ctx its1
kubectl config delete-context wds1 || true
kflex ctx wds1

# switch back to its1 context to crete 2 remote clusters, add OCM agents to them, and register them back to the KubeStellar core
kflex ctx its1

echo -e "\nWaiting for OCM cluster manager to be ready..."

wait-for-cmd "[[ \$(kubectl --context its1 get deployments.apps -n open-cluster-management -o jsonpath='{.status.readyReplicas}' cluster-manager 2>/dev/null) -ge 1 ]]"

echo -e "\nCreating cluster and context for cluster 1 and 2..."

: set flags to "" if you have installed KubeStellar on an OpenShift cluster
flags="--force-internal-endpoint-lookup"
clusters=(cluster1 cluster2);
for cluster in "${clusters[@]}"; do
   kind create cluster --name ${cluster}
   kubectl config rename-context kind-${cluster} ${cluster}
   clusteradm --context its1 get token | grep '^clusteradm join' | sed "s/<cluster_name>/${cluster}/" | awk '{print $0 " --context '${cluster}' --singleton '${flags}'"}' | sh
done

echo -e "Checking that the CSR for cluster 1 and 2 appears..."

echo""
echo "Approving CSR for cluster1 and cluster2..."
checking_cluster cluster1
checking_cluster cluster2

echo""
echo "Checking the new clusters are in the OCM inventory and label them"
kubectl --context its1 get managedclusters
kubectl --context its1 label managedcluster cluster1 location-group=edge name=cluster1
kubectl --context its1 label managedcluster cluster2 location-group=edge name=cluster2

note the new line

wait-for-cmd "[[ \$(kubectl --context its1 get deployments.apps -n open-cluster-management -o jsonpath='{.status.readyReplicas}' cluster-manager 2>/dev/null) -ge 1 ]]"
MikeSpreitzer commented 4 weeks ago

@clubanderson , @pdettori: has this Issue been resolved by release 0.7.1 of KubeFlex and KubeStellar requiring that?

pdettori commented 3 weeks ago

@MikeSpreitzer: as far as I could test this yes.

@clubanderson: ok to close this issue?

clubanderson commented 3 weeks ago

yep - we are beyond this now - closing