issue with multiple static webservers and kubernetes updates

sagemathinc / cocalc

CoCalc: Collaborative Calculation in the Cloud

https://CoCalc.com

Other

1.17k stars 216 forks source link

issue with multiple static webservers and kubernetes updates #1096

Closed williamstein closed 8 years ago

williamstein commented 8 years ago

Imagine updating the static website with a new image. The smc-webapp-static kubernetes pods got updated one by one (or maybe several at once), but not all at once. A user could very easily refresh their browser during this update. The way haproxy is configured now, I think it'll round-robin the static HTTP requests, so maybe index.html comes from one pod, but 3.js comes from another (not yet updated!) pod, etc., and the result is a mixed up disaster.

williamstein commented 8 years ago

@haraldschilly thoughts about this. I did hit this earlier today.

haraldschilly commented 8 years ago

well, if you ever encounter a file 3.js in production, we have a serious issue. they're n-[hash].js like 1-9f1d3f9d20d9dc796bde.js

what was the error exactly?

when we changed this to be in a container, I shared with you some ideas how this can be done. the key is to untangle updating the index file from the asset files. since they're tied together, exactly this can happen. one of my ideas was to use a CDN instead of pods, or well, more to the point, instead of recreating and replacing the running pods just updating the files they're serving (assets first, then the index file to conclude the update)

williamstein commented 8 years ago

what was the error exactly?

It was something like "1-9f1d3f9d20d9dc796bde.js is missing" or whatever. It happened for a few minutes.

instead of recreating and replacing the running pods just updating the files they're serving (assets first, then the index file to conclude the update)

That will still be subject to the same sort of "race condition", right? The window of time is smaller, but it is still an issue. Also, if we did that we would have to be careful to still be able to very easily and safely roll back to previous versions.

For now I'll try just changing to exactly 1 static servers instead of 5. There's never much load, and I think with 1 it'll fully create the new pod, then start it, then stop the old one, then direct traffic to the new one.

Alternatively, maybe k8s has a mode where it starts all the new ones, then instantly switches traffic, rather than changing traffic little by little.

In any case, I think this is a problem to solve via k8s.

williamstein commented 8 years ago

Also, before doing this I just realized it's critical that the autolabeling of pre-emptible and non-premptible nodes is available, since if there is one only smc-webapp-static pods, and it is served from a pre-emptible node, then the site could be down when it pre-empts.

williamstein commented 8 years ago

I just reduced the number of static webservers from 5 to 2, which is fine for our app, and that seems to hide this problem. So lowering priority.

haraldschilly commented 8 years ago

I'm currently looking into how the server is actually setup. It might be that with some additional rules, we can use cloudflare's caching to fix this completely.

haraldschilly commented 8 years ago

I've told cloudflare to cache everything in /static/* and to use 2 hours TTL for their edge caching. I assume, that this means that old static files will still be cached and available, while they're actually gone from the restarted webservers.

williamstein commented 8 years ago

I think your idea solves this problem.