On day 1 we were caught offguard when only ~40 participants could log onto the hub, and saw these messages in the autoscaler logs.
Long story short: Make sure your VPC network settings offer enough internal IP addresses for your cluster from the start!
Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. InsufficientFreeAddressesInSubnet - There are not enough free addresses in subnet 'subnet-0da0f458c2cb44757' to satisfy the requested number of instances. Launching EC2 instance failed.
Over the last couple months we've only had up to 30 simultaneous hub users, which is fine for these settings. But going to 50+ surfaced these issues.
Unfortunately it turned out that it is not simply a matter of changing these network settings and having more IP addresses available. We first tried to change the terraform configuration above to the following matching the pangeo hub configuration with CIDR settings that allow for 8000+ unique IPs per subnet
That lead to terraform errors, and possible just isn't doable without destroying the existing VPC and EKS cluster running in that VPC (see https://aws.amazon.com/premiumsupport/knowledge-center/vpc-ip-address-range/). Eventually we ran terraform apply -target=module.eks and deleted the cluster (but fortunately not other things being used by the hackweek such as the EC2 instance with the database, and S3 bucket with tutorial data!).
With the cluster deleted, the helm history was also gone (so the mappings of everyone's EBS home directories) and and jupyterhub configuration we had previously applied, so we had to redeploy jupyterhub and everyone started with new home directories.
On day 1 we were caught offguard when only ~40 participants could log onto the hub, and saw these messages in the autoscaler logs.
Long story short: Make sure your VPC network settings offer enough internal IP addresses for your cluster from the start!
Turns out this error was due to 2 things:
And we were forcing everything into a single availability zone instead of spreading everyone across multiple data centers:
These settings were taken from examples in the terraform module repository we were using https://github.com/terraform-aws-modules/terraform-aws-eks/search?p=1&q=%2F24&type=code
Over the last couple months we've only had up to 30 simultaneous hub users, which is fine for these settings. But going to 50+ surfaced these issues.
Unfortunately it turned out that it is not simply a matter of changing these network settings and having more IP addresses available. We first tried to change the terraform configuration above to the following matching the pangeo hub configuration with CIDR settings that allow for 8000+ unique IPs per subnet
That lead to terraform errors, and possible just isn't doable without destroying the existing VPC and EKS cluster running in that VPC (see https://aws.amazon.com/premiumsupport/knowledge-center/vpc-ip-address-range/). Eventually we ran
terraform apply -target=module.eks
and deleted the cluster (but fortunately not other things being used by the hackweek such as the EC2 instance with the database, and S3 bucket with tutorial data!).With the cluster deleted, the helm history was also gone (so the mappings of everyone's EBS home directories) and and jupyterhub configuration we had previously applied, so we had to redeploy jupyterhub and everyone started with new home directories.