oracle / terraform-provider-oci

Terraform Oracle Cloud Infrastructure provider
https://www.terraform.io/docs/providers/oci/
Mozilla Public License 2.0
755 stars 673 forks source link

OKE - Ability to Set Node Taints #1504

Open steve-gray opened 2 years ago

steve-gray commented 2 years ago

Community Note

Description

Ability to set node taints for OKE on node pools, allowing for workloads to be split across nodes based on roles. Today this kind of split is only possible through manual tainting of nodes (onerous) or using labels and anti-affinity scheduling rules per workload to keep all other pods off nodes.

New or Affected Resource(s)

oci_containerengine_node_pool

Potential Terraform Configuration


  taint {
    key    = "special"
    value  = "true"
    effect = "PREFER_NO_SCHEDULE"
  }

References

Present in AWS and other cloud provider terraform modules for a while. Example(s) of prior art are:

bbenlazreg commented 1 year ago

Any updates on this ?

steve-gray commented 1 year ago

You can force this in by setting the kubelet-extra-args in the cloud-init script as a workaround @bbenlazreg - thats what we're doing. It works well enough, but it means now we have to template that when creating OKE clusters, which isn't great.

jkrajniak commented 1 year ago

@steve-gray could you provide a sample tf with this approach?

manics commented 1 year ago

@jkrajniak try something like this

resource "oci_containerengine_node_pool" "pool1" {
  ...

  node_metadata = {
    # https://blogs.oracle.com/cloud-infrastructure/post/container-engine-for-kubernetes-custom-worker-node-startup-script-support
    # https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengusingcustomcloudinitscripts.htm
    user_data = base64encode(<<-EOT
      #!/bin/bash
      curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
      bash /var/run/oke-init.sh --kubelet-extra-args "--register-with-taints=${var.kubernetes-pool1-taint}"
      EOT
    )
  }

  ...
}
winston0410 commented 1 year ago

@manics I have checked the doc, and tried to apply your snippets. The apply worked, but the taint does not appear in the node. Is your snippet still working for you?

manics commented 1 year ago

@winston0410 I haven't tried it recently. The last time I deployed this was with version 4.101.0 of registry.terraform.io/hashicorp/oci

tkellen commented 5 months ago

This does the trick currently:

node_metadata = {
    user_data = base64encode(<<-EOT
      #!/bin/bash
      export KUBELET_EXTRA_ARGS="--register-with-taints=node.wescaleout.cloud/routing=true:NoSchedule"
      curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode > /var/run/oke-init.sh
      bash /var/run/oke-init.sh
      EOT
    )
  }

The content of http://169.254.169.254/opc/v2/instance/metadata/oke_init_script:

#!/usr/bin/env bash
set -x
set -e
set -o pipefail

v1_sha_file="ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz.sha256"
v2_sha_file=$(echo "ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz" | sed 's/\.tgz/-v2\.tgz.sha256/')

verifyChecksum() {
    if grep 'PRETTY_NAME="Oracle Linux Server 8\..*"' /etc/os-release >/dev/null; then
    openssl dgst -sha256 -signature $1 -verify $2 $3
    else
        OPENSSL_FIPS=1 openssl dgst -sha256 -signature $1 -verify $2 $3
    fi
}

downloadAnsible() {
    curl -v --fail -L0 https://objectstorage.us-chicago-1.oraclecloud.com/n/odx-oke/b/tkw-cloud-init-prd-0/o/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz -o /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz &&
    curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_artifact_signing_key > /var/run/tkw/oke-signing.pub &&
    if curl -v --fail -L0 https://objectstorage.us-chicago-1.oraclecloud.com/n/odx-oke/b/tkw-cloud-init-prd-0/o/$v1_sha_file -o /var/run/tkw/$v1_sha_file && \
    verifyChecksum /var/run/tkw/$v1_sha_file /var/run/tkw/oke-signing.pub /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz; then
        echo "Ansible bundle is being signed with artifact-signing-key"
    elif curl -v --fail -L0 https://objectstorage.us-chicago-1.oraclecloud.com/n/odx-oke/b/tkw-cloud-init-prd-0/o/$v2_sha_file -o /var/run/tkw/$v2_sha_file && \
    verifyChecksum /var/run/tkw/$v2_sha_file /var/run/tkw/oke-signing.pub /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz; then
        echo "Ansible bundle is being signed with artifact-signing-key-v2"
    else
        return 1
    fi
}

exec &> >(tee -ia /var/run/oke-init.log)
if [ ! -f /var/log/cloud-init-output.log ]; then
    exec &> >(tee -ia /var/log/cloud-init-output.log)
fi
exec 2>&1

if [ -f /etc/.oke_init_complete ]; then
    echo "OKE provisioning already completed... Exiting..."
    exit 0
fi

if [ -f /etc/oke/oke-install.sh ]; then
    until bash -x "/etc/oke/oke-install.sh" "$@"
    do
        echo "oke-install failed...retrying in 5s..."
        sleep 5
    done
else
    mkdir -p /var/run/tkw
    rm -rf /var/run/tkw/*
    until downloadAnsible
    do
        echo "Failed to download TKW ansible bundle...retrying in 5s"
        rm -rf /var/run/tkw/*
        mkdir -p /var/run/tkw
        sleep 5
    done
    tar -xzvf /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz -C /var/run/tkw
    until bash -x "/var/run/tkw/bootstrap.sh" "$@"
    do
        echo "bootstrap failed...retrying in 5s..."
        sleep 5
    done
fi

The content of /etc/oke/oke-install.sh on Oracle-Linux-8.9-2024.01.26-0-OKE-1.28.2-679

#!/bin/bash
set -xe
set -o pipefail

echo "$(date) Starting OKE bootstrap"

# Load necessary functions that will be used later in the script
source /etc/oke/oke-functions.sh

# Allow user to specify arguments through custom cloud-init
while [[ $# -gt 0 ]]; do
  key="$1"
  case "$key" in
    --kubelet-extra-args)
        export KUBELET_EXTRA_ARGS="$2"
        shift
        shift
        ;;
    --cluster-dns)
        export CLUSTER_DNS="$2"
        shift
        shift
        ;;
    --apiserver-endpoint)
        export APISERVER_ENDPOINT="$2"
        shift
        shift
        ;;
    --kubelet-ca-cert)
        export KUBELET_CA_CERT="$2"
        shift
        shift
        ;;
    *) # Ignore unsupported args
        shift
        ;;
  esac
done

export OKE_BOOTSTRAP_METRICS_FILE_PATH="/etc/oke/metric.py"

# Captures the start time of worker node bootstrapping. Avoid placing any code that can be considered part of node
# bootstrapping above this line.
bootstrap_start_time=$(time_in_ms)

KUBELET_EXTRA_ARGS="${KUBELET_EXTRA_ARGS:-}"
CLUSTER_DNS="${CLUSTER_DNS:-}"
APISERVER_ENDPOINT="${APISERVER_ENDPOINT:-$(get_apiserver_host)}"
KUBELET_CA_CERT="${KUBELET_CA_CERT:-}"

# Location of proxymux config and drop-in for proxymux config
PROXYMUX_CONFIG_PATH="/etc/proxymux/config.yaml"
PROXYMUX_CERTS_SERVICE_D_PATH="/etc/systemd/system/proxymux-certs.service.d"
mkdir -p "${PROXYMUX_CERTS_SERVICE_D_PATH}"

# Execute NPWF/BYON-specific logic
if [[ -n "$(get_oke_pool_id)" ]]; then
  service_env_rule "PROXYMUX_ENDPOINT" "certs" > "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  env_rule "PROXYMUX_ARGS" "--config ${PROXYMUX_CONFIG_PATH}" >> "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  get_bootstrap_kubelet_conf >/etc/kubernetes/bootstrap-kubelet.conf
  get_kubelet_client_ca >/etc/kubernetes/ca.crt
elif [[ -n "$APISERVER_ENDPOINT" && -n "$KUBELET_CA_CERT" ]]; then
  # Use the bootstrap endpoint to allow the BYON node to attempt to join the cluster
  service_env_rule "PROXYMUX_ENDPOINT" "bootstrap" > "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  env_rule "PROXYMUX_ARGS" "--server-host ${APISERVER_ENDPOINT}" >> "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  echo "$KUBELET_CA_CERT" | base64 -d > /etc/kubernetes/ca.crt
  get_oke_k8version > /etc/oke/oke-k8s-version
else
  echo "--apiserver-endpoint and/or --kubelet-ca-cert args must be set"
  exit 1
fi

# Get the pause image for the given region/realm/k8s version and populate the crio config
REGION="$(get_region)"
REALM="$(get_realm)"
K8S_VERSION=$(get_oke_k8version | awk -F'v' '{print $NF}')
PAUSE_IMAGE=$(get_pause_image "$REGION" "$REALM" "$K8S_VERSION")
sed -i s,"PAUSE_IMAGE_PLACEHOLDER","$PAUSE_IMAGE",g /etc/crio/crio.conf
export K8S_VERSION PAUSE_IMAGE

# Get instance ocid for proxymux and kubelet configs
INSTANCE_ID="$(get_instance_id)"
export INSTANCE_ID

# Get info needed to populate the proxymux config
PROXYMUX_PORT="$(get_proxymux_port)"
TM_ID="$(get_oke_tm)"
SHORT_CLUSTER_ID="$(get_cluster_label)"
PRIVATE_NODE_IP=$(get_private_ip)
TENANCY_ID="$(get_tenancy_id)"
GPU=false
if [[ "$(get_shape)" == *"GPU"* ]]; then
  NET_INF=eno2
  GPU=true
else
  NET_INF=ens3
fi

# Populate the proxymux config
cat > $PROXYMUX_CONFIG_PATH << EOF

node-id: ${INSTANCE_ID}
net-inf: ${NET_INF}

server-addr: https://${APISERVER_ENDPOINT}:${PROXYMUX_PORT}
tm-id: ${TM_ID}
cluster-id: ${SHORT_CLUSTER_ID}
public-ip-address:
private-ip-address: ${PRIVATE_NODE_IP}
node-name: ${PRIVATE_NODE_IP}
cert-path: /var/lib/kubelet/pki
tenancy-id: ${TENANCY_ID}
bind-addr: 172.16.11.1:80
oci-realm: ${REALM}
EOF

# Get info needed to populate the kubelet config
KUBELET_CONFIG=/etc/kubernetes/kubelet-config.json

# Add kubelet args for ONSRs
IS_ONSR="$(get_oke_is_onsr)"
if [[ $IS_ONSR == "true" ]];then
  if (semantic_version_lt "$K8S_VERSION" "1.24.0");then
    echo "$(jq '. += {"streamingConnectionIdleTimeout": "5m", "featureGates": "DynamicKubeletConfig=false"}' $KUBELET_CONFIG)" > $KUBELET_CONFIG
  else
    echo "$(jq '. += {"streamingConnectionIdleTimeout": "5m"}' $KUBELET_CONFIG)" > $KUBELET_CONFIG
  fi
fi

# Get node labels placed by OKE, including user-specified initial node labels
NODE_LABELS="$(get_node_labels)"
export NODE_LABELS

# Get default kubelet args
KUBELET_DEFAULT_ARGS="$(get_kubelet_default_args)"
export KUBELET_DEFAULT_ARGS

MAX_PODS="$(get_max_pods)"
NATIVE_POD_NETWORKING="$(get_native_pod_networking)"
if [[ -n ${MAX_PODS} && -n ${NATIVE_POD_NETWORKING} ]]; then
  # Append max-pods to kubelet args from oke-max-pods if using OCI VCN IP Native CNI. Kubelet only cares about the last flag if flags repeat.
  KUBELET_EXTRA_ARGS="${KUBELET_EXTRA_ARGS} --max-pods ${MAX_PODS}"
fi

IS_PREEMPTIBLE="$(get_is_preemptible)"
GPU_TAINT="nvidia.com/gpu=:NoSchedule"
PREEMPTIBLE_TAINT="oci.oraclecloud.com/oke-is-preemptible=:NoSchedule"
# Add taint for GPU nodes and preemptible instances. If customer specifies additional taints through kubelet-extra-args, then merge the taints
if [[ "$GPU" == "true" || "$IS_PREEMPTIBLE" == "true" ]]; then
  LOCAL_TAINT=""
  if [[ "$GPU" == "true" ]]; then
    LOCAL_TAINT="${GPU_TAINT}"
  fi
  if [[ "$IS_PREEMPTIBLE" == "true" ]]; then
    if [[ -n "$LOCAL_TAINT" ]]; then
      LOCAL_TAINT="${LOCAL_TAINT},${PREEMPTIBLE_TAINT}"
    else
      LOCAL_TAINT="${PREEMPTIBLE_TAINT}"
    fi
  fi
  NODE_TAINTS=$(echo "$KUBELET_EXTRA_ARGS" | { grep -o -E -- '--register-with-taints=[^ ]+' || true; })
  if [[ -n "$NODE_TAINTS" ]]; then
    NODE_TAINTS="${NODE_TAINTS},${LOCAL_TAINT}"
    KUBELET_EXTRA_ARGS=$(echo "$KUBELET_EXTRA_ARGS" | sed 's/--register-with-taints[^ ]\+//')
    KUBELET_EXTRA_ARGS="${KUBELET_EXTRA_ARGS} ${NODE_TAINTS}"
  else
    KUBELET_DEFAULT_ARGS="${KUBELET_DEFAULT_ARGS} --register-with-taints=${LOCAL_TAINT}"
  fi
fi

# Path for kubelet drop-in files
KUBELET_SERVICE_D_PATH="/etc/systemd/system/kubelet.service.d"
mkdir -p "$KUBELET_SERVICE_D_PATH"

# Store default kubelet args and extra kubelet args in environment variables to be used by kubelet
service_env_rule "KUBELET_DEFAULT_ARGS" "$KUBELET_DEFAULT_ARGS" > "${KUBELET_SERVICE_D_PATH}"/kubelet-default-args.conf
service_env_rule "KUBELET_EXTRA_ARGS" "$KUBELET_EXTRA_ARGS" > "${KUBELET_SERVICE_D_PATH}"/kubelet-extra-args.conf

# Disable swap volumes
sed -i '/swap/ s/^\(.*\)$/# \1/g' /etc/fstab
swapoff -a

# Enable and restart necessary systemd services
daemon_reload
enable_and_restart "$(get_container_runtime_service)"
proxymux_certs_start_time=$(time_in_ms)
enable_and_restart 'proxymux-certs'
if [[ -n "${proxymux_certs_start_time}" ]]; then
    emit_elapsed_time_metric "oke.workerNode.softwareBootstrap.ProxymuxCertsStart.Time" "${proxymux_certs_start_time}"
fi

# Handle clusterDNS and providerID here. User-specified clusterDNS will have the highest priority. Otherwise, use clusterDNS from ansible
# args for NPWFs or proxymux endpoint for BYON. The proxymux endpoint will place clusterDNS in /etc/oke/oke-cluster-dns by default
if [[ -z "$CLUSTER_DNS" ]]; then
  CLUSTER_DNS_PATH="/etc/oke/oke-cluster-dns"
  export CLUSTER_DNS_PATH
  CLUSTER_DNS="$(get_cluster_dns)"
fi
export CLUSTER_DNS
echo "$(jq --arg CLUSTER_DNS "$CLUSTER_DNS" --arg INSTANCE_ID "$INSTANCE_ID" '. += {"clusterDNS": [$CLUSTER_DNS], "providerID": $INSTANCE_ID}' $KUBELET_CONFIG)" > ${KUBELET_CONFIG}

daemon_reload
enable_and_restart 'kubelet'
enable_and_restart 'systemd-journald'
if [[ "$GPU" == "true" && -f /etc/systemd/system/nvidia-modprobe.service ]]; then
  enable_and_restart 'nvidia-modprobe'
fi
if [[ "$GPU" == "true" && -f /etc/systemd/system/nvidia-persistenced.service ]]; then
  enable_and_restart 'nvidia-persistenced'
fi
enable_and_restart 'kubelet-monitor'
enable_and_restart 'kube-container-runtime-monitor'
sudo systemctl enable oke-node-startup-cmds

# Captures the end time of worker node bootstrapping. Avoid placing any code that can be considered part of node
# bootstrapping below this line.
if [[ -n "$bootstrap_start_time" ]]; then
    emit_elapsed_time_metric "oke.workerNode.softwareBootstrap.Time" "${bootstrap_start_time}"
fi

echo "$(date) Finished OKE bootstrap"