pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.24k stars 17.79k forks source link

BUG: read_csv with `memory_map=True` on BytesIO object fails #45630

Open RehanSD opened 2 years ago

RehanSD commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
from io import BytesIO
df = pd.DataFrame([[1, 2]])
bio = BytesIO()
df.to_csv(bio)
bio.seek(0)
pd.read_csv(bio, memory_map=True)

Issue Description

The read_csv fails, and provides this error:

UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-7-c255f3bb77b8> in <module>
----> 1 pd.read_csv(bio, memory_map=True)

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312
    313         return wrapper

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    678     kwds.update(kwds_defaults)
    679
--> 680     return _read(filepath_or_buffer, kwds)
    681
    682

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    573
    574     # Create the parser.
--> 575     parser = TextFileReader(filepath_or_buffer, **kwds)
    576
    577     if chunksize or iterator:

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    931
    932         self.handles: IOHandles | None = None
--> 933         self._engine = self._make_engine(f, self.engine)
    934
    935     def close(self):

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
   1215             # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
   1216             # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217             self.handles = get_handle(  # type: ignore[call-overload]
   1218                 f,
   1219                 mode,

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    680
    681     # memory mapping needs to be the first step
--> 682     handle, memory_map, handles = _maybe_memory_map(
    683         handle,
    684         memory_map,

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py in _maybe_memory_map(handle, memory_map, encoding, mode, errors, decode)
   1085         wrapped = cast(
   1086             BaseBuffer,
-> 1087             _MMapWrapper(handle, encoding, errors, decode),  # type: ignore[arg-type]
   1088         )
   1089     finally:

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py in __init__(self, f, encoding, errors, decode)
    959                 continue
    960             self.attributes[attribute] = getattr(f, attribute)()
--> 961         self.mmap = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    962
    963     def __getattr__(self, name: str):

UnsupportedOperation: fileno

It seems that the error is because fileno is being called on a BytesIO object. This code does work in pandas 1.3.4, which is odd, so I took a look at the sources to see what was different and noticed that in _maybe_memory_map, the except when trying to instant the _MMapWrapper was removed from common.py. (Old common.py for reference).

Expected Behavior

I would expect the read_csv to succeed and the DataFrame to be read.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bb1f651536508cdfef8550f93ace7849b00046ee python : 3.8.12.final.0 python-bits : 64 OS : Darwin OS-release : 21.1.0 Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.0 numpy : 1.21.4 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.0.4 Cython : None pytest : 6.2.5 hypothesis : None sphinx : 4.3.1 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.6.4 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 7.30.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : 2021.11.1 gcsfs : None matplotlib : 3.2.2 numba : None numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.16.0 pyarrow : 3.0.0 pyreadstat : None pyxlsb : None s3fs : 2021.11.1 scipy : 1.7.3 sqlalchemy : 1.4.27 tables : 3.6.1 tabulate : None xarray : 0.20.1 xlrd : 2.0.1 xlwt : None zstandard : None
twoertwein commented 2 years ago

I don't think there is a point in using memory_map with BytesIO. The data is already in memory.

This was purposefully changed in #44777 and is mentioned in the whatsnew for 1.4.0. Silently failing (previous behavior) gives the illusion that it worked, even though it was never using mmap for BytesIO.

RehanSD commented 2 years ago

Thank you @twoertwein! I just wanted to confirm that this was expected behavior! I'm wondering though if it makes sense to include a more descriptive error message so people know what's going on?

twoertwein commented 2 years ago

Catching UnsupportedOperation (and maybe AttributeError) and then re-raising a more clear message might be worth it.