pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.25k stars 17.79k forks source link

BUG: read_fwf modifies / corrupts object (string) whitespace data #51569

Open RonaldBarnes opened 1 year ago

RonaldBarnes commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
from io import StringIO

data = """
CompanyName Alice        7500.00 5  1 3 5 D F 
OrganizationBob          8790.00 6 1 4 A 9 BC 
"""

df = pd.read_fwf(StringIO(data),
  widths=[12,12,8,2,12],
  header=None,
  names=["Company", "Contact", "Pay_sum", "Pay_count", "Credit_score"],
  )
print(df.values)

[['CompanyName' 'Alice' 7500.0 5 '1 3 5 D F']
 ['Organization' 'Bob' 8790.0 6 '1 4 A 9 BC']]

Issue Description

Note in code sample that the Credit_score has had leading (and trailing) spaces removed. This is now irretrievably corrupted data.

Expected Behavior

Expected: The leading (at minimum) spaces are preserved, as they have significance.

The field width should match the widths or colspecs specified unless explicitly requesting otherwise.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-32-generic Version : #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 30 17:03:34 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_CA.UTF-8 LOCALE : en_CA.UTF-8 pandas : 1.5.3 numpy : 1.23.5 pytz : 2022.1 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : 0.29.33 pytest : 7.2.1 hypothesis : 6.65.1 sphinx : 6.1.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.8.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : 1.0.9 fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.3 numba : 0.56.4 numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 10.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
RonaldBarnes commented 1 year ago

Consider the following code:

import pandas as pd
from io import StringIO

data = ["left      ","  centre  ","     right"]

print(f"data: {data}")

## Preserves whitespace:
df = pd.DataFrame([data], columns=["One", "Two", "Three"])

## Preserves whitespace:
df_csv = pd.read_csv(
    StringIO(",".join(data)),
    header=None,
    names=["One", "Two", "Three"],
    )

## Corrupts whitespace:
df_fwf = pd.read_fwf(
    StringIO("".join(data)),
    header=None,
    widths=[10,10,10],
    names=["One", "Two", "Three"],
    )

print("DataFrame values:")
print(df.values)
print("read_csv values:")
print(df_csv.values)
print("read_fwf values:")
print(df_fwf.values)

data: ['left      ', '  centre  ', '     right']
DataFrame values:
[['left      ' '  centre  ' '     right']]
read_csv values:
[['left      ' '  centre  ' '     right']]
read_fwf values:
[['left' 'centre' 'right']]

The one dataframe creation method where I explicitly specify the column widths is the only method that changes my data when reading it.

RonaldBarnes commented 1 year ago

Currently, there is a bit of a work-around for this issue, but it's quite an "anti-pattern", and very counter-intuitive.

Adding a delimiter argument to the read_fwf call produces different output, despite there being no delimiters in the fixed-width file (pretty much by definition):

Using the code snippet from previous comment, but adding a delimiter (must be something that does not exist at the boundaries of fields!):

df_fwf_delim = pd.read_fwf(
    StringIO("".join(data)),
    header=None,
    widths=[10,10,10],
    names=["One", "Two", "Three"],
    delimiter="?",
    )

print("read_fwf DELIMITER values:")
print(df_fwf_delim.values))

read_fwf DELIMITER values:
[['left      ' '  centre  ' '     right']]

The source of this behaviour is found at:

https://github.com/pandas-dev/pandas/blob/3f3102b55458959481f4337e699ccd3b90460544/pandas/io/parsers/python_parser.py#L1182

https://github.com/pandas-dev/pandas/blob/3f3102b55458959481f4337e699ccd3b90460544/pandas/io/parsers/python_parser.py#L1281

It seems that read_fwf has been made to work with tables to the detriment of actual fixed-width files (a format predating YAML, JSON, CSV, etc.).

A read_table already exists, so I question whether this behaviour belongs in this function.

RonaldBarnes commented 1 year ago

Final note (for now):

There is no mention of delimiters for read_fwf in the API reference at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html

There is a confusing mention of them in the user_guide at https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-fwf-reader

delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., ‘~’).

It can be argued that trailing spaces are "filler characters", but not leading spaces.

Like in a Python script file, they have significance and should not be removed unless explicitely requested.

RonaldBarnes commented 1 year ago

take

d3sbar commented 1 year ago

I am using read_fwf function for data processing and I ran into a different behavior where:

Without specifying delimiter or lineterminator keywords -- the function works if file has more than 1 line (except it fails to read last line in a file), but fails when only 1 line in the text file (by default file has no header row): pd.read_fwf(my_file_path, widths= _widths)

Only when I specify delimiter AND lineterminator -- does the function work properly to read all lines in the file: pd.read_fwf(cclf_path, widths= _widths, header=None, delimiter='?', lineterminator='\n', on_bad_lines='Bad Line Error')

Seems to me that the "required TextReader kwargs" for read_fwf should either be documented (easy path to avoid regression) or the defaults need to re-visited for the purpose of processing a "fixed width file" (by definition no delimiters, typically no headers and many times, conditional column count based on a prefix value on the line, etc).

RonaldBarnes commented 1 year ago

Re-opening.

Still hoping to hear feedback on current behaviour & effectiveness of this patch in addressing it.