sagemathinc / cocalc-kubernetes

Run CoCalc on a Kubernetes cluster
Other
22 stars 21 forks source link

Cocalc project restarts. How to investigate? #21

Open Debilski opened 4 years ago

Debilski commented 4 years ago

We are experimenting with Cocalc (a slightly slimmed image with fewer kernels and with increased memory defaults) for remote teaching/pair programming. (It works pretty well!) I am currently noticing three different types of crashes and would like to get a hint as to how to find out why the crash occurred/how I can fix it/see the logs.

1) Python kernel crashes. Seems to occur when I allocate too much memory in a numpy array for example. The relevant cell gets a red tag with the kernel killed message. All understandable, I can live with that. (Although I wouldn’t mind seeing this somewhere in some project admin/server admin logs.)

2) Project Pod sometimes gets killed. All I see is a Killed event in kubectl get events. Doesn’t happen super often, so it is not too bad, but I’d still like to get an idea why.

3) Project restarts without notice. Sometimes this happens every 10 minutes while people are working on a project, so it doesn’t seem to be some idle timeout. (I figured it’s not the worst thing that can happen for teaching, as it clears all hidden variables and gives the student a clean state. ;) ) This is the nastiest problem as the reason is very unclear to me and I wouldn’t know where to look (and which limit to increase).

Any hints?

williamstein commented 4 years ago

It might be that just updating the images would fix the problem. I don't know. Note that I spent about a month last year creating cocalc-kubernetes based on how cocalc-docker worked, but we've had a grand total of zero customers for cocalc-kubernetes (compared to quite a few for cocalc-docker). Thus development on cocalc-kubernetes has stalled, due to lack of demand from serious customers.

Debilski commented 4 years ago

Thanks for the info. The images are already running an updated (and slightly patched – the /health endpoint wouldn’t work, causing even earlier crashes) image. I hadn’t looked into cocalc-docker though. Maybe it would already be sufficient for the next edition of our course.