modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 650 forks source link

Modin not working with Django celery #7053

Open Abhi5h3k opened 5 months ago

Abhi5h3k commented 5 months ago

I tried Ray and Dask and both failed to work from celery task.

Is there any way to use Modin inside Django celery task ?

stackoverflow

For Dask the error was:

AssertionError: daemonic processes are not allowed to have children
[2024-03-12 09:13:07,640: INFO/ForkPoolWorker-7] Closing Nanny at 'tcp://127.0.0.1:38131'. Reason: nanny-instantiate-failed
YarShev commented 5 months ago

Hi @Abhi5h3k, thanks for opening this question! I am afraid it is not possible to use Modin with an engine like Ray, Dask or MPI inside a celery task. Why you are seeing the error is the following. When running a celery app, you create some celery worker processes depending usually on CPUs available, which will execute your tasks. Inside a celery task you use Modin, which spawns its own worker processes depending on the engine chosen to distribute operations on data. Sometimes, it is not possible. Read a more detailed answer on this matter on this post. You could try to initiatilize a Modin engine beforehand and use Modin objects inside your celery tasks.

import ray

ray.init(...)

import modin.pandas as pd

df = pd.DataFrame(...)
...

However, as I said, both celery and Modin will spawn their own worker processes, which may lead to oversubscription. Also, I am not sure if that will work at all. Modin transparently distributes the data and computation so that you can continue using the same pandas API while being able to work with more data faster. You just need to replace the import import pandas as pd to import modin.pandas as pd and Modin will take care of execution. So no need to use any other distributed tool (like celery in your case) to process operations in parallel alongside with Modin.