modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 651 forks source link

BUG: outofmemory read from big file and dump to a new one #7363

Open wanghaisheng opened 1 month ago

wanghaisheng commented 1 month ago

Modin version checks

Reproducible Example

import modin.pandas as pd
import os
inputfilepath = "top-domains-1m-in.csv"

os.environ["RAY_memory_usage_threshold"] = '0.9'

# Combine all conditions
df = pd.read_csv(inputfilepath, encoding="ISO-8859-1")

Issue Description

my file is almost 2 G try to set os.environ["RAY_memory_usage_threshold"] =0.9 it says float not support after some filter, tocsv dump give me memory error

Expected Behavior

it should max use 90% of my laptop

Error Logs

```python-traceback 2024-08-07 13:07:41,906 INFO worker.py:1781 -- Started a local Ray instance. UserWarning: `read_*` implementation has mismatches with pandas: Data types of partitions are different! Please refer to the troubleshooting section of the Modin documentation to fix this issue. UserWarning: is not currently supported by PandasOnRay, defaulting to pandas implementation. Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation. (_remote_exec_multi_chain pid=20536) (_remote_exec_multi_chain pid=20536) Traceback (most recent call last): (_remote_exec_multi_chain pid=20536) File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 423, in deserialize_objects (_remote_exec_multi_chain pid=20536) obj = self._deserialize_object(data, metadata, object_ref) (_remote_exec_multi_chain pid=20536) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (_remote_exec_multi_chain pid=20536) File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 280, in _deserialize_object (_remote_exec_multi_chain pid=20536) return self._deserialize_msgpack_data(data, metadata_fields) (_remote_exec_multi_chain pid=20536) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (_remote_exec_multi_chain pid=20536) File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 235, in _deserialize_msgpack_data (_remote_exec_multi_chain pid=20536) python_objects = self._deserialize_pickle5_data(pickle5_data) (_remote_exec_multi_chain pid=20536) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (_remote_exec_multi_chain pid=20536) File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 225, in _deserialize_pickle5_data (_remote_exec_multi_chain pid=20536) obj = pickle.loads(in_band) (_remote_exec_multi_chain pid=20536) ^^^^^^^^^^^^^^^^^^^^^ (_remote_exec_multi_chain pid=20536) MemoryError (_remote_exec_multi_chain pid=6340) (_remote_exec_multi_chain pid=6340) obj = pickle.loads(in_band, buffers=buffers) (_remote_exec_multi_chain pid=6340) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ --------------------------------------------------------------------------- RayTaskError(RaySystemError) Traceback (most recent call last) Cell In[1], [line 41](vscode-notebook-cell:?execution_count=1&line=41) [33](vscode-notebook-cell:?execution_count=1&line=33) # filtered_df = df[df['indexdate'] != 'unk'] [34](vscode-notebook-cell:?execution_count=1&line=34) # filtered_df = df[df['indexdate'].str.contains('month', case=False, na=False)] [35](vscode-notebook-cell:?execution_count=1&line=35) # filtered_df = df[df['indexdate'].str.contains('1 year', case=False, na=False)] (...) [38](vscode-notebook-cell:?execution_count=1&line=38) # filtered_df = df[df['indexdate'].str.contains('2 years', case=False, na=False)] [39](vscode-notebook-cell:?execution_count=1&line=39) # filtered_df = df[df['domain'].str.contains('ai', case=False, na=False)] [40](vscode-notebook-cell:?execution_count=1&line=40) filtered_df = df[df['Intheirownwords'].str.contains(' ai ', case=False, na=False)] ---> [41](vscode-notebook-cell:?execution_count=1&line=41) filtered_df.to_csv('domain-ai-in-title.csv') [43](vscode-notebook-cell:?execution_count=1&line=43) filtered_df = filtered_df[filtered_df['domain'].isin(rankdomains)] [44](vscode-notebook-cell:?execution_count=1&line=44) filtered_df.to_csv('top-4m-domain-ai-in-title.csv') File d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\modin\logging\logger_decorator.py:144, in enable_logging..decorator..run_and_log(*args, **kwargs) [129](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:129) """ [130](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:130) Compute function with logging if Modin logging is enabled. [131](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:131) (...) [141](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:141) Any [142](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:142) """ [143](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:143) if LogMode.get() == "disable": --> [144](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:144) return obj(*args, **kwargs) [146](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:146) logger = get_logger() [147](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:147) logger.log(log_level, start_line) ... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 225, in _deserialize_pickle5_data obj = pickle.loads(in_band) ^^^^^^^^^^^^^^^^^^^^^ MemoryError Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?3eee492b-abf0-439d-872b-e3378420424f) or open in a [text editor](command:workbench.action.openLargeOutput?3eee492b-abf0-439d-872b-e3378420424f). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)... ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.12.3.final.0 python-bits : 64 OS : Windows OS-release : 11 Version : 10.0.22631 machine : AMD64 processor : AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Chinese (Simplified)_China.936 Modin dependencies ------------------ modin : 0.31.0 ray : 2.34.0 dask : 2024.7.1 distributed : None pandas dependencies ------------------- ... zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?b44a0fe7-dd7f-46e8-a1d3-de47d7cbd2a1) or open in a [text editor](command:workbench.action.openLargeOutput?b44a0fe7-dd7f-46e8-a1d3-de47d7cbd2a1). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
Retribution98 commented 1 month ago

Hi @wanghaisheng

Sorry, I'm not sure I understood you correctly. Modin is not responsible for the Ray parameter RAY_memory_usage_threshold.

Your reproducer seems to be correct. Please contact Ray for more information on this.

I might also suggest that you use Modin cofiguration variable to limit the memory used:

import modin.config as cfg

cfg.Memory.put(2 * 2**30)

or

import os

os.environ["MODIN_MEMORY"] = "2147483648"