We have no strict guarantee that there will be m1-ultramem-40 nodes available when we need them, due to that I want to prepare and scale up and down some times again before the 27th when the course starts.
Fallback strategies are for example to use m1-ultramem-80 nodes or even larger, they would be more clumsy to scale down with though.
Burst scaleup tests
When we scale up to ~1000 cores we could run into GCP_STOCKOUT error along the way. This means GCP in the region/zone we reside in didn't have enough servers available. In order to get a feel if this is a risk, I want to try scaleup ahead of time a few times just to gauge how much issues we might experience.
I'll test this during the upcoming monday, one week ahead of the actual course start.
Jul 19 - 07.40 UTC+0 -- Big scaleup and scaledown to 24 m1-ultramem-40 nodes without issues
Jul 20 - 16.20 UTC+0 -- Big scaleup to 24 m1-ultramem-40 nodes. 23 came online and then GCPs auto-repair started for some reason. During such auto-repair I'm not allowed to either cancel the operation or scale down as the auto-repair operation blocks that. Reading up more I conclude that it's likely to take one hour for this operation to complete. The auto-repair issue went away after 25 minutes. I estimate this scaleup test costed ~65 USD.
Jul 20 - 21.25 UTC+0 -- Big scaleup and scaledown to 24 m1-ultramem-40 nodes without issues
Jul 20 - 23.26 UTC+0 -- Big scaleup and scaledown to 24 m1-ultramem-40 nodes without issues
Lessons learned
GCP/GKE's auto-repair feature can be a risk to enable on a node-pool as it can block you from managing the cluster.
Seems good so far. I've disabled auto-repair to avoid issues and prepared a m1-ultramem-80 node-pool if we run into trouble finding m1-ultramem-40 nodes.
We have no strict guarantee that there will be m1-ultramem-40 nodes available when we need them, due to that I want to prepare and scale up and down some times again before the 27th when the course starts.
Fallback strategies are for example to use m1-ultramem-80 nodes or even larger, they would be more clumsy to scale down with though.
Burst scaleup tests
When we scale up to ~1000 cores we could run into GCP_STOCKOUT error along the way. This means GCP in the region/zone we reside in didn't have enough servers available. In order to get a feel if this is a risk, I want to try scaleup ahead of time a few times just to gauge how much issues we might experience.
I'll test this during the upcoming monday, one week ahead of the actual course start.
Lessons learned
GCP/GKE's auto-repair feature can be a risk to enable on a node-pool as it can block you from managing the cluster.