microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
185 stars 9 forks source link

Really hard to recover from interrupted kernel #9

Closed MikeBeller closed 3 years ago

MikeBeller commented 3 years ago

I typically can't get the cluster to stop/shutdown if I interrupt my kernel while a graph was running. Calling cluster.shutdown() or cluster.close() just hangs. I do see there is an issue for dask gateway (https://github.com/dask/dask-gateway/issues/155) that seems to be related.

But when this happens to me on MPC, I have the further problem that when I restart my kernel, I often acquire a cluster which is "lame" and refuses to start executing any tasks. Perhaps this is because there are still old resources allocated to me?

Basically it is very hard to recover from interrupted jobs. Thoughts on how I can work around this?

MikeBeller commented 3 years ago

Adding to this -- now that I have some perspective from the very helpful answer to #8 -- one way to approach this issue:

is there a way to release resources for a cluster to which I no longer hold any (variable) references? I.e. is there a call like "release resources of any cluster associated with my machine"? Or second best -- "release all reasources associated with the cluster whose dashboard is at this URL" (because I usually do have the URL somewhere). ?

Maybe I shouldn't be worrying so much about it, but I hate the idea of my jobs still running, using up cloud resources, when I know I can't get back the results.

TomAugspurger commented 3 years ago

Hmm I'm having a bit of trouble understanding what's going on. So I can try to explain what's supposed to happen, and what might fail. One question up front though:

I typically can't get the cluster to stop/shutdown if I interrupt my kernel while a graph was running

By "interrupt" do you mean CTRL-C / hitting the "stop" button? Or do you mean restarting the kernel?

When you interrupt the kernel, that shouldn't affect the cluster at all, except perhaps if you interrupt it while it's calling .compute() on a dask collection. I think in that case Dask will cancel the futures associated with the .compute() and so some stuff will stop running.

When you restart a kernel, then the cluster should gracefully shutdown. Starting a cluster registers a handler to close the cluster that fires when the client Python process shuts down. There are cases where that can't happen (e.g. your Python interpreter segfaults) that are tracked in https://github.com/dask/dask-gateway/issues/260.

To handle that last case, we also include a --idle-timeout on the Dask Scheduler of something like 10-20 minutes. If the scheduler goes that long without any "activity" (e.g. client connecting, task submitted) then the cluster will be closed.

If you do have cluster resources that you worry are idle, you can just by creating a Gateway client and listing your running clusters, as described at https://gateway.dask.org/usage.html#connect-to-a-dask-gateway-server

>>> gateway = dask_gateway.Gateway()
>>> gateway.list_clusters()

If there are (unexpected) running clusters there, you can stop them

for cluster in gateway.list_clusters():
    gateway.stop_cluster(cluster.name)
MikeBeller commented 3 years ago

Thanks Tom this is super helpful.

To answer your question: I was interrupting the kernel via "interrupt kernel" menu item on Jupyter notebook. I do believe (but am not 100% sure) that I interrupted during a .compute(), and the computation continued on (as evidenced from the dashboard). I don't recall whether it continued past my subsequent kernel restart (which would be unexpected per the above).

But now that I understand how it should work, if I see something like this happen again I will (1) know whether it's actually unexpected behavior, and (2) know how to clean up the resources. Cheers.