snowex-hackweek / jupyterhub

jupyterhub configuration for snowex hackweek 2021
https://snowex.hackweek.io
MIT License
1 stars 0 forks source link

Summary of Day1 Hiccup (Limits on Number of IP addresses on EKS Cluster) #18

Open scottyhq opened 3 years ago

scottyhq commented 3 years ago

On day 1 we were caught offguard when only ~40 participants could log onto the hub, and saw these messages in the autoscaler logs.

Long story short: Make sure your VPC network settings offer enough internal IP addresses for your cluster from the start!

Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. InsufficientFreeAddressesInSubnet - There are not enough free addresses in subnet 'subnet-0da0f458c2cb44757' to satisfy the requested number of instances. Launching EC2 instance failed.

Turns out this error was due to 2 things:

  1. every time a user pod starts it requires several internal IP addresses (for a node, for the pod itself, for an EBS volume, and seemingly for other things that i don't fully understand). Our network setting CIDR blocks only allowed up to 256 unique internal IPs. https://github.com/snowex-hackweek/jupyterhub/blob/3586e3a5676f9257bde7e9d3b8bb70303851f447/terraform/eks/main.tf#L53-L54

And we were forcing everything into a single availability zone instead of spreading everyone across multiple data centers:

  1. https://github.com/snowex-hackweek/jupyterhub/blob/3586e3a5676f9257bde7e9d3b8bb70303851f447/terraform/eks/main.tf#L105

These settings were taken from examples in the terraform module repository we were using https://github.com/terraform-aws-modules/terraform-aws-eks/search?p=1&q=%2F24&type=code

Over the last couple months we've only had up to 30 simultaneous hub users, which is fine for these settings. But going to 50+ surfaced these issues.

Unfortunately it turned out that it is not simply a matter of changing these network settings and having more IP addresses available. We first tried to change the terraform configuration above to the following matching the pangeo hub configuration with CIDR settings that allow for 8000+ unique IPs per subnet

public_subnets       = ["172.16.0.0/19", "172.16.32.0/19", "172.16.64.0/19"]
private_subnets      = ["172.16.96.0/19", "172.16.128.0/19", "172.16.160.0/19"]

That lead to terraform errors, and possible just isn't doable without destroying the existing VPC and EKS cluster running in that VPC (see https://aws.amazon.com/premiumsupport/knowledge-center/vpc-ip-address-range/). Eventually we ran terraform apply -target=module.eks and deleted the cluster (but fortunately not other things being used by the hackweek such as the EC2 instance with the database, and S3 bucket with tutorial data!).

With the cluster deleted, the helm history was also gone (so the mappings of everyone's EBS home directories) and and jupyterhub configuration we had previously applied, so we had to redeploy jupyterhub and everyone started with new home directories.