pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

Can the separator of the read csv function support regular splitting? #16044

Closed Esword618 closed 1 week ago

Esword618 commented 2 weeks ago

Description

Here is the content of my data:

     11.50225    34.62792   341.48861    60.23845    33.86916   340.52216
     16.08011    46.36068   112.74108    82.09562    45.90745   112.68871
      5.44448    64.20202    84.74526    92.26079    63.48149    84.83877
    154.21007    40.30874   284.20968   248.08102    40.32464   284.05453
     44.78606    81.08370   306.90320   207.53215    80.58101   307.01056
    187.79354    52.18742   348.14328   254.43741    52.35809   348.16040
      3.19632    58.35471   336.89014    83.53841    59.67276   335.88022
      4.53459    54.00255    23.75481    66.02106    51.58699    23.86702
JulianCologne commented 2 weeks ago

afaik this is not possible with polars currently because the separator must be a single character.

what you are looking for is the equivalent of pandas read_fwf to read "fixed-width-formatted" data (https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html)

there are a few issues already but it is not yet supported.

8312 #3151

Esword618 commented 2 weeks ago

In pandas, I use the read_csv function of pandas and then use the period parameter sep='\s+' to split the data.

df = pd.read_csv(filename, header=None, skiprows=6, sep='\s+')
JulianCologne commented 2 weeks ago

yeah, this also works but as I said currently polars does not support regex or string separator but only a single char.

there are workarounds but they are not very nice 😆

DATA = """\
     11.50225    34.62792   341.48861    60.23845    33.86916   340.52216
     16.08011    46.36068   112.74108    82.09562    45.90745   112.68871
      5.44448    64.20202    84.74526    92.26079    63.48149    84.83877
    154.21007    40.30874   284.20968   248.08102    40.32464   284.05453
     44.78606    81.08370   306.90320   207.53215    80.58101   307.01056
    187.79354    52.18742   348.14328   254.43741    52.35809   348.16040
      3.19632    58.35471   336.89014    83.53841    59.67276   335.88022
      4.53459    54.00255    23.75481    66.02106    51.58699    23.86702
"""

pl.read_csv(DATA.encode(), has_header=False, new_columns=["data"]).with_columns(
    pl.col("data")
    .str.strip_chars(" ")
    .str.replace_all(" +", " ")
    .str.split(" ")
    .list.to_struct()
).unnest(columns="data").with_columns(pl.all().cast(pl.Float64))

shape: (8, 6)
┌───────────┬──────────┬───────────┬───────────┬──────────┬───────────┐
│ field_0   ┆ field_1  ┆ field_2   ┆ field_3   ┆ field_4  ┆ field_5   │
│ ---       ┆ ---      ┆ ---       ┆ ---       ┆ ---      ┆ ---       │
│ f64       ┆ f64      ┆ f64       ┆ f64       ┆ f64      ┆ f64       │
╞═══════════╪══════════╪═══════════╪═══════════╪══════════╪═══════════╡
│ 11.50225  ┆ 34.62792 ┆ 341.48861 ┆ 60.23845  ┆ 33.86916 ┆ 340.52216 │
│ 16.08011  ┆ 46.36068 ┆ 112.74108 ┆ 82.09562  ┆ 45.90745 ┆ 112.68871 │
│ 5.44448   ┆ 64.20202 ┆ 84.74526  ┆ 92.26079  ┆ 63.48149 ┆ 84.83877  │
│ 154.21007 ┆ 40.30874 ┆ 284.20968 ┆ 248.08102 ┆ 40.32464 ┆ 284.05453 │
│ 44.78606  ┆ 81.0837  ┆ 306.9032  ┆ 207.53215 ┆ 80.58101 ┆ 307.01056 │
│ 187.79354 ┆ 52.18742 ┆ 348.14328 ┆ 254.43741 ┆ 52.35809 ┆ 348.1604  │
│ 3.19632   ┆ 58.35471 ┆ 336.89014 ┆ 83.53841  ┆ 59.67276 ┆ 335.88022 │
│ 4.53459   ┆ 54.00255 ┆ 23.75481  ┆ 66.02106  ┆ 51.58699 ┆ 23.86702  │
└───────────┴──────────┴───────────┴───────────┴──────────┴───────────┘

However, best way if the file is not huge is probably to read the data, replace all \s+ with ',' and then read_csv the "clean" csv using polars

IsmaelMousa commented 2 weeks ago

no, because the implementation of the separator param behaviour in the read_csv method only accept a single byte character.

stinodego commented 1 week ago

As answered above: this is not possible.