Open DrMaphuse opened 2 years ago
As a workaround, I am using this code to read a fixed-width file. It's more than twice as fast as pandas' "read_fwf" for me when reading large files.
df = pl.read_csv(file_path)
df.columns = ['full_str']
for w, col in zip(fwidths, column_names):
df = df.with_column(pl.col("full_str").str.slice(0, w).str.rstrip().alias(col))
df = df.with_column(pl.col("full_str").str.slice(w, None).alias("full_str"))
df = df.drop('full_str')
$ cat fwf_test.txt
NAME STATE TELEPHONE
John Smith WA 418-Y11-4111
Mary Hartford CA 319-Z19-4341
Evan Nolan IL 219-532-c301
This will probably perform faster:
df = pl.read_csv(
"fwf_test.txt",
has_header=False,
skip_rows=1,
new_columns=["full_str"]
)
column_names = [ "NAME", "STATE", "TELEPHONE" ]
widths = [20, 10, 12]
# Calculate slice values from widths.
slice_tuples = []
offset = 0
for i in widths:
slice_tuples.append((offset, i))
offset += i
df.with_columns(
[
pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip().alias(col)
for slice_tuple, col in zip(slice_tuples, column_names)
]
).drop("full_str")
shape: (3, 3)
┌───────────────┬───────┬──────────────┐
│ NAME ┆ STATE ┆ TELEPHONE │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪═══════╪══════════════╡
│ John Smith ┆ WA ┆ 418-Y11-4111 │
│ Mary Hartford ┆ CA ┆ 319-Z19-4341 │
│ Evan Nolan ┆ IL ┆ 219-532-c301 │
└───────────────┴───────┴──────────────┘
This will probably perform faster:
Very nice! Opening a 500MiB text file: pandas' read_fwf: ~48 secs my code: ~24 secs your code: ~5.5 secs
Here is my preferred implementation for the time being:
def read_fixed_width_file_as_strs(file_path: Union[Path, str], col_names_and_widths: Dict[str, int], *, skip_rows: int = 0) -> pl.DataFrame:
"""
Reads a fixed-width file into a dataframe.
Reads all values as strings (as indicated by function name).
Strips all values of leading/trailing whitespaces.
Args:
col_names_and_widths: A dictionary where the keys are the column names and the values are the widths of the columns.
"""
# Source: adapted from https://github.com/pola-rs/polars/issues/3151#issuecomment-1397354684
df = pl.read_csv(
file_path,
has_header=False,
skip_rows=skip_rows,
new_columns=["full_str"],
)
# transform col_names_and_widths into a Dict[cols name, Tuple[start, width]]
slices: Dict[str, Tuple[int, int]] = {}
start = 0
for col_name, width in col_names_and_widths.items():
slices[col_name] = (start, width)
start += width
df = df.with_columns(
[
pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip_chars().alias(col)
for col, slice_tuple in slices.items()
]
).drop(["full_str"])
return df
FWIW, I am using str.extract_groups
to read/parse a fwf file https://stackoverflow.com/a/78545671/2383070
Describe your feature request
Pandas implements a function for reading fixed-width text files, which are produced, for example, by some SQL queries. It would be neat to have this in polars as well.