Open jfecunha opened 2 weeks ago
I got this working by changing the workers_pool.imap_unordered
to workers_pool.map
. So I guess that this is related to the iterator that imap_unordered
returns each time a task is executed. Does Ray have in-build progress_bars to get the status of the jobs?
That was the main reason for using imap_unordered
against map
.
What happened + What you expected to happen
I am using ray multiprocessing.Pool to run data processing tasks within a docker container in GCP Vertex AI (Single machine).
The task that is being processed consists of the following steps (This is for simplicity. We make further transformations on the JSON file before generate the numpy matrix. ):
Everything works as expected until the end of the script when the pool is shutdown and some statistics are computed about the amount of files that were processed.
The error that I see is the following:
Looking into the folder where the outputs are expected I see that all the files are processed. We have some exceptions for files not processed due to data quality issues.
I am trying to understand why this error is happening and causing the job to fail in vertex ai.
I ran other jobs with fewer data and I don't have this issue. As a benchmark 400k files work as expected and higher than that, I see the reported error. I don't know if this helps.
Versions / Dependencies
I build a docker image to run in an AMD-based machine in GCP with the following dependencies:
I also tried to build the docker image with ray-supported images (FROM rayproject/ray:nightly-py39-cpu) but the outcome was the same. Also tried Python 3.10.
Reproduction script
This script is executed in the docker entrypoint.
Issue Severity
High: It blocks me from completing my task.