pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.05k stars 1.72k forks source link

Improve string split API and DataTypes (`split`, `splitn`, `split_exact`) #16565

Open Julian-J-S opened 1 month ago

Julian-J-S commented 1 month ago

Description

I wanted to bring this back once more before "locking in the API" with V1.0 (see also #11640, #13649)

Example

pl.DataFrame({"str": ["hello world !", "a b c d e"]}).with_columns(
    split=pl.col("str").str.split(" "),
    split_exact=pl.col("str").str.split_exact(" ", n=2),
    splitn=pl.col("str").str.splitn(" ", n=2),
)

# shape: (2, 4)
# ┌───────────────┬───────────────────────────┬───────────────────────┬─────────────────────┐
# │ str           ┆ split                     ┆ split_exact           ┆ splitn              │
# │ ---           ┆ ---                       ┆ ---                   ┆ ---                 │
# │ str           ┆ list[str]                 ┆ struct[3]             ┆ struct[2]           │
# ╞═══════════════╪═══════════════════════════╪═══════════════════════╪═════════════════════╡
# │ hello world ! ┆ ["hello", "world", "!"]   ┆ {"hello","world","!"} ┆ {"hello","world !"} │
# │ a b c d e     ┆ ["a", "b", "c", "d", "e"] ┆ {"a","b","c"}         ┆ {"a","b c d e"}     │
# └───────────────┴───────────────────────────┴───────────────────────┴─────────────────────┘

Problems

Suggested Improvement

justcodingandy commented 2 weeks ago

would changing return type of splitn / split_exact to Array allow the following?

.with_columns(pl.all().str.split_exact('=', 1).list.get(1))

I am facing issue like this one Convert struct to list.

cmdlineluser commented 2 weeks ago

@justcodingandy You can extract struct fields by index.

>>> pl.select(pl.lit('foo=bar').str.split_exact('=', 1).struct[1])
shape: (1, 1)
┌─────────┐
│ field_1 │
│ ---     │
│ str     │
╞═════════╡
│ bar     │
└─────────┘