Open Abhi5h3k opened 7 months ago
Hi @Abhi5h3k, thanks for opening this question! I am afraid it is not possible to use Modin with an engine like Ray, Dask or MPI inside a celery task. Why you are seeing the error is the following. When running a celery app, you create some celery worker processes depending usually on CPUs available, which will execute your tasks. Inside a celery task you use Modin, which spawns its own worker processes depending on the engine chosen to distribute operations on data. Sometimes, it is not possible. Read a more detailed answer on this matter on this post. You could try to initiatilize a Modin engine beforehand and use Modin objects inside your celery tasks.
import ray
ray.init(...)
import modin.pandas as pd
df = pd.DataFrame(...)
...
However, as I said, both celery and Modin will spawn their own worker processes, which may lead to oversubscription. Also, I am not sure if that will work at all. Modin transparently distributes the data and computation so that you can continue using the same pandas API while being able to work with more data faster. You just need to replace the import import pandas as pd
to import modin.pandas as pd
and Modin will take care of execution. So no need to use any other distributed tool (like celery in your case) to process operations in parallel alongside with Modin.
I tried Ray and Dask and both failed to work from celery task.
Is there any way to use Modin inside Django celery task ?
stackoverflow
For Dask the error was: