modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.59k stars 647 forks source link

ValueError: The 'nrows' option is not supported with the 'pyarrow' engine #7321

Closed azhuvath closed 1 week ago

azhuvath commented 1 week ago

The pandas.read_csv method supports ‘c’, ‘python’, ‘pyarrow’ engines. It looks like modin is not supported when using 'pyarrow' engine.

Intel(R) Extension for Scikit-learn enabled (https://github.com/intel/scikit-learn-intelex) 2024-06-18 05:01:34,908 INFO worker.py:1753 -- Started a local Ray instance. Traceback (most recent call last): File "/home/ad/anomaly_detection.py", line 41, in model_fitting() File "/home/ad/anomaly_detection.py", line 37, in model_fitting raise e File "/home/ad/anomaly_detection.py", line 14, in model_fitting data_csv = pd.read_csv('./data.csv', engine='pyarrow') File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/utils.py", line 511, in wrapped return func(params.args, params.kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 125, in run_and_log return obj(*args, kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/pandas/io.py", line 227, in read_csv return _read(kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/pandas/io.py", line 117, in _read pd_obj = FactoryDispatcher.read_csv(kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/core/execution/dispatching/factories/dispatcher.py", line 207, in read_csv return cls.get_factory()._read_csv(kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/core/execution/dispatching/factories/factories.py", line 268, in _read_csv return cls.io_cls.read_csv(*kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 125, in run_and_log return obj(args, kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/core/io/file_dispatcher.py", line 159, in read query_compiler = cls._read(*args, kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 125, in run_and_log return obj(*args, *kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/core/io/text/text_file_dispatcher.py", line 1068, in _read pd_df_metadata = cls.read_callback( File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 125, in run_and_log return obj(args, kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/modin/core/storage_formats/pandas/parsers.py", line 381, in read_callback return pandas.read_csv(*args, kwargs) File "/home/ad/analytics_env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv return _read(filepath_or_buffer, kwds) File "/home/ad/analytics_env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 620, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/home/ad/analytics_env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1607, in init options = self._get_options_with_defaults(engine) File "/home/ad/analytics_env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1643, in _get_options_with_defaults raise ValueError( ValueError: The 'nrows' option is not supported with the 'pyarrow' engine

YarShev commented 1 week ago

@anmyachev, could you take a look at this?

anmyachev commented 1 week ago

Hi @azhuvath! Could you provide modin, pandas and pyarrow versions?

anmyachev commented 1 week ago

Current situation with unsupported read_csv parameters with pyarrow: https://github.com/pandas-dev/pandas/issues/38872. Upstream pandas does not support nrows parameter.

anmyachev commented 1 week ago

I found an issue in Modin.