neurohackademy / nh2020-jupyterhub

hub.neurohackademy.org: Deployment config, docker image, documentation.
17 stars 27 forks source link

Actively verify GCP can deliver m1-ultramem-40 nodes #31

Closed consideRatio closed 4 years ago

consideRatio commented 4 years ago

We have no strict guarantee that there will be m1-ultramem-40 nodes available when we need them, due to that I want to prepare and scale up and down some times again before the 27th when the course starts.

Fallback strategies are for example to use m1-ultramem-80 nodes or even larger, they would be more clumsy to scale down with though.

Burst scaleup tests

When we scale up to ~1000 cores we could run into GCP_STOCKOUT error along the way. This means GCP in the region/zone we reside in didn't have enough servers available. In order to get a feel if this is a risk, I want to try scaleup ahead of time a few times just to gauge how much issues we might experience.

I'll test this during the upcoming monday, one week ahead of the actual course start.

Lessons learned

GCP/GKE's auto-repair feature can be a risk to enable on a node-pool as it can block you from managing the cluster.

consideRatio commented 4 years ago

Seems good so far. I've disabled auto-repair to avoid issues and prepared a m1-ultramem-80 node-pool if we run into trouble finding m1-ultramem-40 nodes.