kbatch job deletion not working properly

mhelleis commented 1 year ago

Hello,

First of all thank you for the amazing job your doing with the PC! It's really an impressive tool.

I'm currently setting up kbatch to run a longer training job and I'm facing the issue that running jobs won't be cancelled correctly (the deletion worked fine with previous jobs that were CPU flavour based).

If I'm trying to stop the job with kbatch job delete <job_name>. The response is

{
  "api_version": "batch/v1",
  "code": null,
  "details": null,
  "kind": "Job",
  "message": null,
  "metadata": {
    "_continue": null,
    "remaining_item_count": null,
    "resource_version": "241058839",
    "self_link": null
  },
  "reason": null,
  "status": "{'startTime': '2023-02-16T08:55:25Z', 'active': 1, 'ready': 1}"
}

However, the job status remains as active and all related pods are still running. Is there any way to force kill running jobs and pods from the client side?

Additional info:

I'm using a modified version of the planetary-computer/gpu-pytorch image (just added azure-storage-file-share)
Setup and job submission was done using the kbatch version from https://github.com/kbatch-dev/kbatch/pull/51
Generally the performance during training using kbatch seems to be significantly slower compared to running the same pipeline in JupyerHub

jessjaco commented 1 year ago

I'm seeing this too, or a variation of it. If I delete a job the pod keeps running.

jmoortgat commented 1 year ago

Has anyone found a solution for this? The issue still seems to persist. I cannot kill a running job or pod either.

jessjaco commented 1 year ago

I discovered that you can shutdown a running cluster using the usual Gateway interface, but it doesn't delete it from the kbatch list, so maybe not much help.

For instance, on the hub run

from dask_gateway import Gateway
gateway = Gateway()
gateway.list_clusters()

which will produce a list of clusters, e.g. [ClusterReport<name=prod.301fdf80f3374fe3a540aaba5dfda115, status=RUNNING>, ClusterReport<name=prod.6032a9ee0e7a48af962b7df2a0679121, status=RUNNING>]. If you don't want to shut them all down, figure out which one you want by putting print(client.dashboard_link) in the code you are running through kbatch and find the name in the address in the pod logs.

Then you can shutdown the cluster using:

gateway.connect(gateway.list_clusters()[0].name).shutdown()

Again, while this appears to shutdown the cluster, it is still listed as running in the pod logs. (I admit that I don't know the difference between a pod and a cluster.)

jmoortgat commented 1 year ago

For me this doesn't show any clusters (empty list). Doesn't seem like "kbatch job submit" starts a cluster.

jessjaco commented 1 year ago

Are you using the dask gateway? Sorry, I assumed you were

jmoortgat commented 1 year ago

No, I'm not.

jessjaco commented 1 year ago

Disregard then

jmoortgat commented 1 year ago

Anyone have other suggestions? I now have multiple attempts at kbatch jobs running that are unable to write output data to an Azure blob due to authentication issues (my fault), so no output will be generated, yet I cannot stop the jobs. These jobs would normally take a week or so to complete, so I cannot do anything until they finish. I can also not stream the log output (kbatch pod logs --stream results in timeout), which makes me worried that the jobs may even be stalled (could be lack of memory, running 4 jobs).

TomAugspurger commented 1 year ago

Most likely there's an issue within kbatch itself, but I unfortunately haven't had the time to work on that. You might be better off setting up compute in your own Azure subscription, and using some other mechanism to manage the compute.

ghidalgo3 commented 2 weeks ago

Closed due to inactivity, feel free to reopen if you would like to continue this discussion.

microsoft / PlanetaryComputer

kbatch job deletion not working properly #188