modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.86k stars 651 forks source link

read_excel function fail when header is set to None #4924

Closed toan-quach closed 1 year ago

toan-quach commented 2 years ago

System information

Describe the problem

read_excel function works fine when I didn't include the parameter header and set it to None with header=None. I have tested it out with an excel that contains a column of only numbers and another excel file with a column that has the 1st row as string and the rest as numbers (the 1st row should also be considered a normal row along with the rest)

Source code / logs

Source code: Case 1

import modin.pandas as pd

data = pd.read_excel('example.xlsx', header=None)
data

Case 2

import modin.pandas as pd

data = pd.read_excel('example_.xlsx', header=None)
data

Excel file to reproduce: example.xlsx example_2.xlsx

Log: Traceback (most recent call last): File "", line 1, in File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap return method(*args, kwargs) File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/pandas/dataframe.py", line 215, in repr result = repr(self._build_repr_df(num_rows, num_cols)) File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap return method(*args, *kwargs) File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/pandas/base.py", line 203, in _build_repr_df return self.iloc[indexer]._query_compiler.to_pandas() File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap return method(args, kwargs) File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 259, in to_pandas return self._modin_frame.to_pandas() File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap return method(*args, *kwargs) File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 115, in run_f_on_minimally_updated_metadata result = f(self, args, **kwargs) File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2840, in to_pandas ErrorMessage.catch_bugs_and_request_email( File "/Users/shiro/.local/share/virtualenvs/taipy-core-fdyg53sb/lib/python3.9/site-packages/modin/error_message.py", line 70, in catch_bugs_and_request_email raise Exception( Exception: Internal Error. Please visit https://github.com/modin-project/modin/issues to file an issue with the traceback and the command that caused this error. If you can't file a GitHub issue, please email bug_reports@modin.org. Internal and external indices on axis 1 do not match.

data = pd.read_excel('tests/data_sample/example.xlsx', header=None) UserWarning: Parallel read_excel is a new feature! If you run into any problems, please visit https://github.com/modin-project/modin/issues. If you find a new issue and can't file it on GitHub, please email bug_reports@modin.org.

mvashishtha commented 2 years ago

@toan-quach thank you for reporting this issue. I can reproduce it at version 14edd1ca99e7b9e836285dfac901b2e89ed93644. If I can see a quick fix, I'll make it now.

mvashishtha commented 2 years ago

I can't tell what's going wrong. We'll have to start by understanding how pandas interprets header=None.

@toan-quach we will try to fix this soon! Until the bug is fixed, you can work around it by reading the data into a pandas dataframe pdf with pandas.read_excel, then converting that dataframe to a Modin dataframe with modin.pandas.DataFrame(pdf).

toan-quach commented 2 years ago

awesome news, thanks for the quick response! Please let me know if I can help in any ways 😄

toan-quach commented 1 year ago

Hi @mvashishtha May I ask if we have any updates on this topic? Thanks!!! 😄

mvashishtha commented 1 year ago

@toan-quach unfortunately no one has taken on this issue yet, so I think it hasn't been fixed. You can watch this issue for updates.

toan-quach commented 1 year ago

@mvashishtha Thanks for the quick response!!! That's unfortunate 😞