pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.17k stars 1.94k forks source link

Fixed-width text file reader #3151

Open DrMaphuse opened 2 years ago

DrMaphuse commented 2 years ago

Describe your feature request

Pandas implements a function for reading fixed-width text files, which are produced, for example, by some SQL queries. It would be neat to have this in polars as well.

jicart commented 1 year ago

As a workaround, I am using this code to read a fixed-width file. It's more than twice as fast as pandas' "read_fwf" for me when reading large files.

df = pl.read_csv(file_path)
df.columns = ['full_str']

for w, col in zip(fwidths, column_names):
    df = df.with_column(pl.col("full_str").str.slice(0, w).str.rstrip().alias(col))
    df = df.with_column(pl.col("full_str").str.slice(w, None).alias("full_str"))
df = df.drop('full_str')
ghuls commented 1 year ago
$ cat fwf_test.txt
NAME                STATE     TELEPHONE  
John Smith          WA        418-Y11-4111
Mary Hartford       CA        319-Z19-4341
Evan Nolan          IL        219-532-c301

This will probably perform faster:

df = pl.read_csv(
    "fwf_test.txt",
    has_header=False,
    skip_rows=1,
    new_columns=["full_str"]
)

column_names = [ "NAME", "STATE", "TELEPHONE" ]
widths = [20, 10, 12]

# Calculate slice values from widths.
slice_tuples = []
offset = 0

for i in widths:
    slice_tuples.append((offset, i))
    offset += i

df.with_columns(
    [
       pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip().alias(col)
       for slice_tuple, col in zip(slice_tuples, column_names)
    ]
).drop("full_str")
shape: (3, 3)
┌───────────────┬───────┬──────────────┐
│ NAME          ┆ STATE ┆ TELEPHONE    │
│ ---           ┆ ---   ┆ ---          │
│ str           ┆ str   ┆ str          │
╞═══════════════╪═══════╪══════════════╡
│ John Smith    ┆ WA    ┆ 418-Y11-4111 │
│ Mary Hartford ┆ CA    ┆ 319-Z19-4341 │
│ Evan Nolan    ┆ IL    ┆ 219-532-c301 │
└───────────────┴───────┴──────────────┘
jicart commented 1 year ago

This will probably perform faster:

Very nice! Opening a 500MiB text file: pandas' read_fwf: ~48 secs my code: ~24 secs your code: ~5.5 secs

DeflateAwning commented 7 months ago

Here is my preferred implementation for the time being:


def read_fixed_width_file_as_strs(file_path: Union[Path, str], col_names_and_widths: Dict[str, int], *, skip_rows: int = 0) -> pl.DataFrame:
    """
    Reads a fixed-width file into a dataframe.
    Reads all values as strings (as indicated by function name).
    Strips all values of leading/trailing whitespaces.

    Args:
        col_names_and_widths: A dictionary where the keys are the column names and the values are the widths of the columns.
    """

    # Source: adapted from https://github.com/pola-rs/polars/issues/3151#issuecomment-1397354684

    df = pl.read_csv(
        file_path,
        has_header=False,
        skip_rows=skip_rows,
        new_columns=["full_str"],
    )

    # transform col_names_and_widths into a Dict[cols name, Tuple[start, width]]
    slices: Dict[str, Tuple[int, int]] = {}
    start = 0
    for col_name, width in col_names_and_widths.items():
        slices[col_name] = (start, width)
        start += width

    df = df.with_columns(
        [
            pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip_chars().alias(col)
            for col, slice_tuple in slices.items()
        ]
    ).drop(["full_str"])

    return df
blaylockbk commented 5 months ago

FWIW, I am using str.extract_groups to read/parse a fwf file https://stackoverflow.com/a/78545671/2383070