pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.9k stars 1.65k forks source link

feat(python): Add new `alpha`, `alphanumeric` and `digit` selectors #16310

Closed alexander-beedie closed 1 week ago

alexander-beedie commented 2 weeks ago

New selectors, making it even easier to classify column names by character type:

One of the nice bonuses in having separate selectors for this is making sure that non-ASCII letters are handled automatically, eg: accented characters in words such as "tweeëntwintig", kanji such as "東京", hangul, etc.

(I also suspect that, even amongst users familiar with regular expressions, a reasonable number wouldn't immediately know the equivalent cs.matches pattern ^[\p{Alphabetic}]+$, which is also a little more cryptic in a codebase πŸ€”)

There is an optional flag ascii_only if you want to limit the definition of "alphabetic" to ASCII, but having Unicode letters recognised by default is a good out-of-the-box experience for more languages.

Examples

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({
    "no1":  [100, 200, 300],
    "cafΓ©": ["espresso", "latte", "mocha"],
    "t/f":  [True, False, None],
    "hmm":  ["aaa", "bbb", "ccc"],
    "都市":  ["東京", "倧ι˜ͺ", "京都"],
})

Select columns with alphabetic names; note that accented characters and kanji are recognised as valid:

df.select(cs.alpha())
# shape: (3, 3)
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
# β”‚ cafΓ©     ┆ hmm ┆ 都市 β”‚
# β”‚ ---      ┆ --- ┆ ---  β”‚
# β”‚ str      ┆ str ┆ str  β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•ͺ═════β•ͺ══════║
# β”‚ espresso ┆ aaa ┆ 東京 β”‚
# β”‚ latte    ┆ bbb ┆ 倧ι˜ͺ β”‚
# β”‚ mocha    ┆ ccc ┆ 京都 β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Constrain the definition of "alphabetic" to ASCII characters:

df.select(cs.alpha(ascii_only=True))
# shape: (3, 1)
# β”Œβ”€β”€β”€β”€β”€β”
# β”‚ hmm β”‚
# β”‚ --- β”‚
# β”‚ str β”‚
# β•žβ•β•β•β•β•β•‘
# β”‚ aaa β”‚
# β”‚ bbb β”‚
# β”‚ ccc β”‚
# β””β”€β”€β”€β”€β”€β”˜

Select columns with non-ASCII alphabetic names :)

df.select(cs.alpha() - cs.alpha(ascii_only=True))
# shape: (3, 2)
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
# β”‚ cafΓ©     ┆ 都市 β”‚
# β”‚ ---      ┆ ---  β”‚
# β”‚ str      ┆ str  β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•ͺ══════║
# β”‚ espresso ┆ 東京 β”‚
# β”‚ latte    ┆ 倧ι˜ͺ β”‚
# β”‚ mocha    ┆ 京都 β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Select all columns except for those with alphabetic names:

df.select(~cs.alpha())
shape: (3, 2)
# β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ no1 ┆ t/f   β”‚
# β”‚ --- ┆ ---   β”‚
# β”‚ i64 ┆ bool  β”‚
# β•žβ•β•β•β•β•β•ͺ═══════║
# β”‚ 100 ┆ true  β”‚
# β”‚ 200 ┆ false β”‚
# β”‚ 300 ┆ null  β”‚
# β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Select alphanumeric names:

# shape: (3, 4)
# β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
# β”‚ no1 ┆ cafΓ©     ┆ hmm ┆ 都市 β”‚
# β”‚ --- ┆ ---      ┆ --- ┆ ---  β”‚
# β”‚ i64 ┆ str      ┆ str ┆ str  β”‚
# β•žβ•β•β•β•β•β•ͺ══════════β•ͺ═════β•ͺ══════║
# β”‚ 100 ┆ espresso ┆ aaa ┆ 東京 β”‚
# β”‚ 200 ┆ latte    ┆ bbb ┆ 倧ι˜ͺ β”‚
# β”‚ 300 ┆ mocha    ┆ ccc ┆ 京都 β”‚
# β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Select alphanumeric names, constraining the definition to ASCII characters:

df.select(cs.alphanumeric(ascii_only=True))
# shape: (3, 2)
# β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
# β”‚ no1 ┆ hmm β”‚
# β”‚ --- ┆ --- β”‚
# β”‚ i64 ┆ str β”‚
# β•žβ•β•β•β•β•β•ͺ═════║
# β”‚ 100 ┆ aaa β”‚
# β”‚ 200 ┆ bbb β”‚
# β”‚ 300 ┆ ccc β”‚
# β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.76%. Comparing base (6804f33) to head (e0f3d8b).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16310 +/- ## ======================================= Coverage 80.75% 80.76% ======================================= Files 1393 1393 Lines 179423 179431 +8 Branches 2922 2922 ======================================= + Hits 144891 144912 +21 + Misses 34029 34016 -13 Partials 503 503 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.