Open agopinath1205 opened 4 months ago
I have two steps in the ray job one runs on GPU to compute embeddings, other one on CPU which is getting stuck.
Hi @agopinath1205,
Can you try the latest Ray (2.33), we have fixed a few issues related to it? You can also use Ray state api (ray list tasks --detail
) to see which task is not completing and which stage it's in.
I am using Kuberay and is there a way i can debug this using kuberay operator?
Same thing, you can run ray list tasks --detail
inside the head pod
Sounds good! I am trying to understand why the last task is getting stuck, what could be the root cause?
I cannot tell until I have more information from ray list tasks --detail
What happened + What you expected to happen
I have a batch ray job where i see that most of the tasks are getting completed but the last task in an actor gets stuck and does not get complete
Ray tasks in batch operation not completing - 144 out of 145 tasks complete but not all
I am not sure if this is a problem on my end or something wrong on ray, or is there some configuration that needs to be added. Any help appreciated.
Versions / Dependencies
I am using ray within Kubernetes ray image is 2.24 -> and running a tensor flow model.
Reproduction script
def process_batch(file_batch, create=False): ds_ray_data = ray.data.read_parquet(file_batch) exploded_df_source_programs = ds_ray_data.map_batches( GenerateEmbeddings, fn_constructor_kwargs={
"model": local_path, # Local testing
Issue Severity
High: It blocks me from completing my task.