modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 651 forks source link

BUG: [RAY] ray initialisation sets _memory and object_store_memory to the same value, leading to crashes and less flexibility #7361

Open Liquidmasl opened 1 month ago

Liquidmasl commented 1 month ago

Modin version checks

Reproducible Example

# set a breakpoint in ...\modin\core\execution\ray\common\utils.py line 138

import modin.pandas as pd
df = pd.DataFrame()

# check the contents of ray_init_kwargs
#or really just look at the code there:

            # object_store_memory = _get_object_store_memory()
            # ray_init_kwargs = {
            #     "num_cpus": CpuCount.get(),
            #     "num_gpus": GpuCount.get(),
            #     "include_dashboard": False,
            #     "ignore_reinit_error": True,
            #     "object_store_memory": object_store_memory,
            #     "_redis_password": redis_password,
            #     "_memory": object_store_memory,
            #     "resources": RayInitCustomResources.get(),
            #     **extra_init_kw,
            # }

Issue Description

modin sets _memory and object_store_memory to the same value. This not only leads to instability and crashes, but it also reduces the flexibility as _memory can be set to a value higher then the shared memory while object_store_memory cannot.

A lot of the issues I faced the last few days with read_parquet() (althrough, this still fills up RAM until my pc crashes), to_parquet(), concat(), etc etc stemmed from the issue that when the object store was full and a spill was attempted, a write violation happend, and a raylet died.

I noticed that modin runs a lot more stable when ray.init() was called manually. This is because there the two values are not set to the same value per default.

Also, it would be great if the ray dashboard was not disabled per default, without being able to enable it when initialising with modin. But I digress.

Expected Behavior

If no manual configuration was done, or env variables where set, the default ray init should be used. And if not default, then not something this debilitating.

After initializing ray manually and just setting _memory to something way larger, stuff just started working. While setting MODIN_MEMORY to something higher when using modins initialisation did not work, because it lead to a value error from RAY stating that object_store_memory cant be set that high (even though I did never care about the object_store_memory.

Error Logs

```python-traceback Replace this line with the error backtrace (if applicable). ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Austria.1252 Modin dependencies ------------------ modin : 0.31.0 ray : 2.34.0 dask : 2024.7.1 distributed : 2024.7.1 pandas dependencies ------------------- pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.2.2 pip : 24.1.2 Cython : 0.29.37 pytest : 8.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.8.2 numba : 0.60.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : 2.0.29 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Retribution98 commented 1 month ago

Hi @Liquidmasl

Modin has these default values because it helps to achieve good performance in general. If you have a specific case and Modin's configuration variables don't help you, you can initialize ray yourself.

Liquidmasl commented 1 month ago

I see. I understand my experience does not stand by any means for everyone. But with these defaults I had numerous bluescreens, freezes and crashes. All in all making debugging and figuring this out a lot more troublesome then necessary.

I did not want to initialize ray myself for the exact cause that I thought modin will know best, but it did give me no option to just adapt the two values that lead to issues for me (_memory and include_dashboard)

if you think the current defaults work fine most of the time and my situation is an outlier, fair enough! I still think introducing config params or env vars that give the option to set _memory, object_store_memory and include_dashboard manually while still relying on modins ray initialisation would be good. As I understood its a relatively new feature of modin that it initialises ray itself. So maybe there will be some changes along the way anyway. For now, now that I understand that, its fine to initialize ray manually