Open RonaldBarnes opened 1 year ago
Consider the following code:
import pandas as pd
from io import StringIO
data = ["left "," centre "," right"]
print(f"data: {data}")
## Preserves whitespace:
df = pd.DataFrame([data], columns=["One", "Two", "Three"])
## Preserves whitespace:
df_csv = pd.read_csv(
StringIO(",".join(data)),
header=None,
names=["One", "Two", "Three"],
)
## Corrupts whitespace:
df_fwf = pd.read_fwf(
StringIO("".join(data)),
header=None,
widths=[10,10,10],
names=["One", "Two", "Three"],
)
print("DataFrame values:")
print(df.values)
print("read_csv values:")
print(df_csv.values)
print("read_fwf values:")
print(df_fwf.values)
data: ['left ', ' centre ', ' right']
DataFrame values:
[['left ' ' centre ' ' right']]
read_csv values:
[['left ' ' centre ' ' right']]
read_fwf values:
[['left' 'centre' 'right']]
The one dataframe creation method where I explicitly specify the column widths is the only method that changes my data when reading it.
Currently, there is a bit of a work-around for this issue, but it's quite an "anti-pattern", and very counter-intuitive.
Adding a delimiter
argument to the read_fwf
call produces different output, despite there being no delimiters in the fixed-width file (pretty much by definition):
Using the code snippet from previous comment, but adding a delimiter
(must be something that does not exist at the boundaries of fields!):
df_fwf_delim = pd.read_fwf(
StringIO("".join(data)),
header=None,
widths=[10,10,10],
names=["One", "Two", "Three"],
delimiter="?",
)
print("read_fwf DELIMITER values:")
print(df_fwf_delim.values))
read_fwf DELIMITER values:
[['left ' ' centre ' ' right']]
The source of this behaviour is found at:
It seems that read_fwf
has been made to work with tables to the detriment of actual fixed-width files (a format predating YAML, JSON, CSV, etc.).
A read_table
already exists, so I question whether this behaviour belongs in this function.
Final note (for now):
There is no mention of delimiters for read_fwf
in the API reference at
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html
There is a confusing mention of them in the user_guide at https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-fwf-reader
delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., ‘~’).
It can be argued that trailing spaces are "filler characters", but not leading spaces.
Like in a Python script file, they have significance and should not be removed unless explicitely requested.
take
I am using read_fwf function for data processing and I ran into a different behavior where:
Without specifying delimiter or lineterminator keywords -- the function works if file has more than 1 line (except it fails to read last line in a file), but fails when only 1 line in the text file (by default file has no header row):
pd.read_fwf(my_file_path, widths= _widths)
Only when I specify delimiter AND lineterminator -- does the function work properly to read all lines in the file:
pd.read_fwf(cclf_path, widths= _widths, header=None, delimiter='?', lineterminator='\n', on_bad_lines='Bad Line Error')
Seems to me that the "required TextReader kwargs" for read_fwf should either be documented (easy path to avoid regression) or the defaults need to re-visited for the purpose of processing a "fixed width file" (by definition no delimiters, typically no headers and many times, conditional column count based on a prefix value on the line, etc).
Re-opening.
Still hoping to hear feedback on current behaviour & effectiveness of this patch in addressing it.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Note in code sample that the Credit_score has had leading (and trailing) spaces removed. This is now irretrievably corrupted data.
Expected Behavior
Expected: The leading (at minimum) spaces are preserved, as they have significance.
The field width should match the
widths
orcolspecs
specified unless explicitly requesting otherwise.Installed Versions