pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.04k forks source link

pandas.read_fwf doesn't work with skiprows=callable #20603

Open ghost opened 6 years ago

ghost commented 6 years ago

Code Sample, a copy-pastable example if possible

import pandas

table = """\
id8141    360.242940   149.910199   11950.7
id1594    444.953632   166.985655   11788.4
id1849    364.136849   183.628767   11806.2
id1230    413.836124   184.375703   11916.8
id1948    502.953953   173.237159   12468.3"""

fwf_path = "fwf.dat"
with open(fwf_path, "wb") as fh:
    fh.write(table.encode('utf8'))

def should_skip(row):
    return True # normally condition

table = pandas.read_fwf(fwf_path, skiprows=should_skip)

Problem description

Traceback (most recent call last): File "fwf_bug.py", line 19, in table = pandas.read_fwf(fwf_path, skiprows=should_skip) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 741, in read_fwf return _read(filepath_or_buffer, kwds) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 449, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 818, in init self._make_engine(self.engine) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 1059, in _make_engine self._engine = klass(self.f, self.options) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 3412, in init PythonParser.init(self, f, **kwds) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 2079, in init self._make_reader(f) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 3416, in _make_reader self.comment, self.skiprows) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 3316, in init self.colspecs = self.detect_colspecs(skiprows=skiprows) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 3373, in detect_colspecs rows = self.get_rows(n, skiprows) File "some-path/_venv/lib/python3.5/site-packages/pandas/io/parsers.py", line 3361, in get_rows if i not in skiprows: TypeError: argument of type 'function' is not iterable

Expected Output

Script runs successfully

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.2 scipy: None pyarrow: None xarray: None IPython: 6.3.0 sphinx: None patsy: None dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
WillAyd commented 6 years ago

Hmm yea it looks like each of the parsers evaluate the callable at different points in time. Unfortunately for the reader being used here, it doesn't look like it ever executes that.

I suppose a simple fix here would be to evaluate the callable in the FixedWidthReader class. It could probably be done in get_rows as there is already some modification of the object being done there (linked below for ref) - want to submit a PR (with tests) for this?

https://github.com/pandas-dev/pandas/blob/6d610a4d9393c0d0335267dc3252ccabb9e51e43/pandas/io/parsers.py#L3378