paciorek / future-kubernetes

Instructions for setting up and using a Kubernetes cluster for running R in parallel using the future package.
39 stars 10 forks source link

Recovering from OOMkill pod failures/evictions #10

Open 1beb opened 3 years ago

1beb commented 3 years ago

One critical piece that I think makes this challenging to use at a larger scale is that R is a garbage collected language.

There are a number of odd situations, especially when reading or writing files that will continue to "grow" memory that ought to be garbage collected but never does. We were discussing this a little bit in the future repository. Henrik suggested using the callr plan which works extremely well when you're working on a single computer, but is incompatible with the setup command that is specified in the future-kubernetes helm chart.

I've been thinking about a number of alternative approaches:

Do you have any thoughts on how one might approach this?

paciorek commented 2 years ago

I'm not sure. It seems like this is trying to avoid what seems like a flaw in how certain circumstances are handled in R. I'd be inclined to see if this could be addressed on the R side.

I think that if you managed to kill the R process in a given pod, the pod would restart, restarting R with it. So there's a chance that could somehow be used, though it feels pretty awkward.

Using ssh should be possible, but I didn't go down that path because it seems like working around Kubernetes rather than using Kubernetes the way it was intended.