pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

Strict parsing for read_fwf #17245

Open eoghanmurray opened 7 years ago

eoghanmurray commented 7 years ago

I'm importing a fixed width file which has 2 types of records (each with their own definitions).

>>> print "good:\n", pandas.read_fwf(StringIO('T1001\nT1020'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'])
good:
  TYPE  A  B  C
0   T1  0  0  1
1   T1  0  2  0
>>> print "good:\n", pandas.read_fwf(StringIO('T2XY\nT2XZ'),  
    widths=[2,1,1], names=['TYPE', 'D', 'E'])
good:
  TYPE  D  E
0   T2  X  Y
1   T2  X  Z
>>> print "silently dropped data from first 2 rows:\n", pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1], names=['TYPE', 'D', 'E'])
silently dropped data from first 2 rows:
  TYPE  D  E
0   T1  0  0
1   T1  0  2
2   T2  X  Y
3   T2  X  Z
>>> print "unexpected NaN fields:\n", pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'])
unexpected NaN fields:
  TYPE  A  B    C
0   T1  0  0  1.0
1   T1  0  2  0.0
2   T2  X  Y  NaN
3   T2  X  Z  NaN

Problem description

I expected that lines not matching the passed-in spec would result in a 'bad line' error for that line, and those lines could be ignored.

Expected Output

I expected these lines to raise an error, with the error_bad_lines option available to ignore the lines and show warnings instead (which could be turned off with warn_bad_lines).

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1], names=['TYPE', 'D', 'E'])
ParserError: Error tokenizing data. Expected 4 characters in line 1, saw 5

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'])
ParserError: Error tokenizing data. Expected 5 characters in line 3, saw 4

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1], names=['TYPE', 'D', 'E'], error_bad_lines=False)
Skipping Line 1: expected 4 characters, saw 5
Skipping Line 2: expected 4 characters, saw 5

  TYPE  D  E
0   T2  X  Y
1   T2  X  Z

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'], error_bad_lines=False)
Skipping Line 3: expected 5 characters, saw 4
Skipping Line 4: expected 5 characters, saw 4

  TYPE  A  B  C
0   T1  0  0  1
1   T1  0  2  0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-87-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IE.UTF-8 LOCALE: None.None pandas: 0.20.3 pytest: 2.9.1 pip: 9.0.1 setuptools: 27.3.0 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: None sphinx: None patsy: None dateutil: 1.5 pytz: 2014.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.4.1 html5lib: None sqlalchemy: 1.1.0b1 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.7.2 s3fs: None pandas_gbq: None pandas_datareader: None
gfyoung commented 7 years ago

I agree that the case of too many fields is a bug and should explicitly raise. As for cases with too few fields, I'm not sure I would agree. It's being interpreted as empty data (or NULL), which is perfectly acceptable.

eoghanmurray commented 7 years ago

Thanks! On the 'too few fields' point maybe the example I gave was lacking. Instead of two sets of widths of [2,1,1,1] and [2,1,1] , the problem would be when there are incompatible widths (& meanings), So e.g. [2,23,14,12,10] and [2,10,11,10]

(In the data I'm looking at, the difference between these two types is signified by the value of the first two characters of the line. Obv. different lines should have been exported to different files, but that is not the legacy data I'm working with.)

mechif commented 2 years ago

I tried the warn_bad_lines and error_bad_lines options - and received a warning that they are being deprecated:

 >>>  FutureWarning: The warn_bad_lines argument has been deprecated and will be removed in a future version

How can I read in a file (these are log files with set column widths) that may sometimes have badly formatted rows? I'd like to skip/delete these rows (but know that they exist) - because the other thousands of lines must be analyzed.

phofl commented 2 years ago

The warning tells you what to do:

FutureWarning: The warn_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.
mechif commented 2 years ago

Thanks. This wasn't output in PyCharm 2021.3.1 I'll learn about on_bad_lines (which isn't documented in the read_fwf documentation)