Actively verify GCP can deliver m1-ultramem-40 nodes

We have no strict guarantee that there will be m1-ultramem-40 nodes available when we need them, due to that I want to prepare and scale up and down some times again before the 27th when the course starts.

Fallback strategies are for example to use m1-ultramem-80 nodes or even larger, they would be more clumsy to scale down with though.

Burst scaleup tests

When we scale up to ~1000 cores we could run into GCP_STOCKOUT error along the way. This means GCP in the region/zone we reside in didn't have enough servers available. In order to get a feel if this is a risk, I want to try scaleup ahead of time a few times just to gauge how much issues we might experience.

I'll test this during the upcoming monday, one week ahead of the actual course start.

Jul 19 - 07.40 UTC+0 -- Big scaleup and scaledown to 24 m1-ultramem-40 nodes without issues
Jul 20 - 16.20 UTC+0 -- Big scaleup to 24 m1-ultramem-40 nodes. 23 came online and then GCPs auto-repair started for some reason. During such auto-repair I'm not allowed to either cancel the operation or scale down as the auto-repair operation blocks that. Reading up more I conclude that it's likely to take one hour for this operation to complete. The auto-repair issue went away after 25 minutes. I estimate this scaleup test costed ~65 USD.
Jul 20 - 21.25 UTC+0 -- Big scaleup and scaledown to 24 m1-ultramem-40 nodes without issues
Jul 20 - 23.26 UTC+0 -- Big scaleup and scaledown to 24 m1-ultramem-40 nodes without issues

Lessons learned

GCP/GKE's auto-repair feature can be a risk to enable on a node-pool as it can block you from managing the cluster.

neurohackademy / nh2020-jupyterhub

Actively verify GCP can deliver m1-ultramem-40 nodes #31

Burst scaleup tests

Lessons learned