terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.38k stars 4.04k forks source link

No default networking add-ons: Terraform waiting for the nodes to be in Ready state (question) #3106

Open mmshin opened 1 month ago

mmshin commented 1 month ago

Hi,

So we're trying to implement the new supported argument to disable default networking add-ons. Before this, we're doing lots of local-exec to uninstall and cleanup vpc cni, kube-proxy and core-dns because we're using alternative add-ons.

Current testing: module definition is creating a cluster without any default networking add-ons+ deploying managed node group. Terraform is stuck in applying because the nodes are in Not Ready state and we're assuming because of the missing CNI add-on. In our current flow, we have a depends_on to wait for eks module to finish before installing Cilium.

Is there any other way to do this without splitting the EKS cluster creation and managed node group just to insert Cilium installation in the middle?

Second solution we're thinking is to add the time_sleep.trigger you did with the default add_ons. So basically, Cilium doesn't depend on the whole module but just implement a time_sleep, just to have ample amount of time before the managed node groups get created.

bryantbiggs commented 1 month ago

Is there any other way to do this without splitting the EKS cluster creation and managed node group just to insert Cilium installation in the middle?

No - you would have to break this up into steps. Roughly:

  1. Deploy control plane without default addons
  2. Deploy Cilium, kube-proxy, and CoreDNS - not using the EKS addon API because the API will want to wait for those addons to reach a running state, otherwise it will mark the addon as failed/degraded
  3. Deploy nodes/compute

You might be able to get away with:

  1. Deploy control plane without default addons
  2. Deploy Cilium
  3. Deploy nodes/compute + kube-proxy and CoreDNS - they will fail for awhile until nodes come up, but it should succeed eventually
mmshin commented 1 month ago

So we're currently deploying Cilium using helm_release and wait is set to false. I ended up testing it with time_sleep

  1. Deploy control plane without default addons
  2. Add time_sleep similar to this
  3. Deploy Cilium (with depends_on time_sleep) and all the other add-ons

In my terraform apply, both Cilium and managed_node_groups were created simultaneously. But because Cilium was faster to deploy, nodes were ready just in time.

Just to confirm, it that the same as the 2nd option you mentioned in your previous comment?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

mmshin commented 2 weeks ago

@mmshin were you able to succeed? I am stuck on same issue, any info or better terraform you can share please?

Hi, yes. So there were 2 options that I tried.

  1. Add time_sleep similar to this before the cilium installation

  2. But then I realized our helm provider definition has this dependency host = module.eks.cluster_endpoint, meaning that time_sleep i added is not needed. And it worked. As long as the helm_release.cilium doesn't wait for the whole module to finish.

So in general, cilium just needed to make sure that the cluster is ready. Then, both cilium and nodes will be created at the same time. Our cilium installation has wait set to false so from TF's perspective, it was done in 9s. So by time that the nodes are done creating, Cilium is already up and the nodes will also be Ready just in time.

I hope my explanation makes sense.

RazaGR commented 2 weeks ago

Thanks for explanation @mmshin , I was actually able to make it work without time_sleep by following above but I am facing another issue with cilium, I am able to nslookup any domain from the pod but can't ping or curl any external domain, internal works fine, all my services are up and running and coredns log show it's fine, I dont have that issue with kube-proxy so definitely something in cilium,

do you have an idea about it or if you could please share your cilium values?

this is what I used:

kubeProxyReplacement: "true"
ipam:
  mode: "eni"
egressMasqueradeInterfaces: "eth0"
eni:
  enabled: true
k8sServiceHost:"http://xxxxx"
k8sServicePort: 443
tunnel: disabled
endpointRoutes:
  enabled: true
hubble:
  metrics:
    enabled:
      - dns
      - drop
      - tcp
      - flow
      - icmp
      - http
  relay:
    enabled: true
  ui:
    enabled: true
envoy:
    enabled: true
gatewayAPI:
  enabled: true
routingMode: "native"