Closed mhelleis closed 2 weeks ago
I'm seeing this too, or a variation of it. If I delete a job the pod keeps running.
Has anyone found a solution for this? The issue still seems to persist. I cannot kill a running job or pod either.
I discovered that you can shutdown a running cluster using the usual Gateway interface, but it doesn't delete it from the kbatch list, so maybe not much help.
For instance, on the hub run
from dask_gateway import Gateway
gateway = Gateway()
gateway.list_clusters()
which will produce a list of clusters, e.g. [ClusterReport<name=prod.301fdf80f3374fe3a540aaba5dfda115, status=RUNNING>, ClusterReport<name=prod.6032a9ee0e7a48af962b7df2a0679121, status=RUNNING>]
. If you don't want to shut them all down, figure out which one you want by putting print(client.dashboard_link)
in the code you are running through kbatch and find the name in the address in the pod logs.
Then you can shutdown the cluster using:
gateway.connect(gateway.list_clusters()[0].name).shutdown()
Again, while this appears to shutdown the cluster, it is still listed as running in the pod logs. (I admit that I don't know the difference between a pod and a cluster.)
For me this doesn't show any clusters (empty list). Doesn't seem like "kbatch job submit" starts a cluster.
Are you using the dask gateway? Sorry, I assumed you were
No, I'm not.
Disregard then
Anyone have other suggestions? I now have multiple attempts at kbatch jobs running that are unable to write output data to an Azure blob due to authentication issues (my fault), so no output will be generated, yet I cannot stop the jobs. These jobs would normally take a week or so to complete, so I cannot do anything until they finish. I can also not stream the log output (kbatch pod logs --stream results in timeout), which makes me worried that the jobs may even be stalled (could be lack of memory, running 4 jobs).
Most likely there's an issue within kbatch
itself, but I unfortunately haven't had the time to work on that. You might be better off setting up compute in your own Azure subscription, and using some other mechanism to manage the compute.
Closed
due to inactivity, feel free to reopen if you would like to continue this discussion.
Hello,
First of all thank you for the amazing job your doing with the PC! It's really an impressive tool.
I'm currently setting up
kbatch
to run a longer training job and I'm facing the issue that running jobs won't be cancelled correctly (the deletion worked fine with previous jobs that were CPU flavour based).If I'm trying to stop the job with
kbatch job delete <job_name>
. The response isHowever, the job status remains as
active
and all related pods are stillrunning
. Is there any way to force kill running jobs and pods from the client side?Additional info:
planetary-computer/gpu-pytorch
image (just addedazure-storage-file-share
)kbatch
version from https://github.com/kbatch-dev/kbatch/pull/51kbatch
seems to be significantly slower compared to running the same pipeline in JupyerHub