modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 649 forks source link

BUG: Apply on axis=1 causes "daemonic processes are not allowed to have children" on some operations on Dask engine, or launches Ray instance #7346

Open data-makerman opened 1 month ago

data-makerman commented 1 month ago

Modin version checks

Reproducible Example

import modin.pandas as pd
import modin.config as cfg_modin

cfg_modin.Engine.put("dask")

def unworking_meaningless_row_operation(row):
    row_to_dict = row.to_dict()
    dict_to_row = pd.Series(row_to_dict)
    return dict_to_row

def fine_meaningless_row_operation(row):
    row = row.str.upper()
    return row

if __name__ == "__main__":
    df = pd.DataFrame(
        {
            "A": ["a", "b", "c", "d"],
            "B": [1, 2, 3, 4],
            "C": [1, 2, 3, 4],
            "D": [1, 2, 3, 4],
        }
    )

    df = df.apply(fine_meaningless_row_operation, axis=1)
    print("Fine worked fine")
    df = df.apply(unworking_meaningless_row_operation, axis=1)

Issue Description

When running a df.apply operation which uses a Series.to_dict() call on a Dask engine, I get a traceback telling me: AssertionError: daemonic processes are not allowed to have children. This was without Ray/modin[ray] installed. Other apply operations succeed.

Installing modin[ray] also causes this operation to succeed. Weirdly (to me), the first operation in the included reproducible example seems to run on the Dask engine before Modin automatically creates a Ray instance to handle the second operation.

Expected Behavior

The Dask engine should support all typical Pandas operations.

Error Logs

```python-traceback python -m rembe.benchmarks.minimal_test Fine worked fine UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 44471 instead 2024-07-17 22:23:59,043 - distributed.nanny - ERROR - Failed to start process Traceback (most recent call last): File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate result = await self.process.start() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/nanny.py", line 752, in start await self.process.start() File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/process.py", line 55, in _call_and_set_future res = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/process.py", line 215, in _start process.start() File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/multiprocessing/process.py", line 118, in start assert not _current_process._config.get('daemon'), \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: daemonic processes are not allowed to have children 2024-07-17 22:23:59,045 - distributed.nanny - ERROR - Failed to start process Traceback (most recent call last): File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate result = await self.process.start() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/nanny.py", line 752, in start await self.process.start() File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/process.py", line 55, in _call_and_set_future res = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/site-packages/distributed/process.py", line 215, in _start process.start() File "/z/home/mxak/miniforge3/envs/datamesh_no_dl/lib/python3.12/multiprocessing/process.py", line 118, in start assert not _current_process._config.get('daemon'), \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: daemonic processes are not allowed to have children ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4815bc32a0ec54965962d03303c93b3498adddf4 python : 3.12.4.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-113-generic Version : #123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_DK.UTF-8 LOCALE : en_DK.UTF-8 Modin dependencies ------------------ modin : 0.31.0+4.g4815bc32 ray : None dask : 2024.7.0 distributed : 2024.7.0 pandas dependencies ------------------- pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 setuptools : 70.3.0 pip : 24.0 Cython : None pytest : 8.2.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.31 tables : None tabulate : None xarray : None xlrd : None zstandard : 0.23.0 tzdata : 2024.1 qtpy : None pyqt5 : None
devin-petersohn commented 1 month ago

Great catch @data-makerman. I can reproduce this locally for Dask, but this type of error is not happening in Ray.