feat(python): Add new `alpha`, `alphanumeric` and `digit` selectors

New selectors, making it even easier to classify column names by character type:

cs.alphanumeric(): only names composed of letters and digits.
cs.alpha(): only names composed of letters.
cs.digit(): only names composed of digits.

One of the nice bonuses in having separate selectors for this is making sure that non-ASCII letters are handled automatically, eg: accented characters in words such as "tweeëntwintig", kanji such as "東京", hangul, etc.

(I also suspect that, even amongst users familiar with regular expressions, a reasonable number wouldn't immediately know the equivalent cs.matches pattern ^[\p{Alphabetic}]+$, which is also a little more cryptic in a codebase 🤔)

There is an optional flag ascii_only if you want to limit the definition of "alphabetic" to ASCII, but having Unicode letters recognised by default is a good out-of-the-box experience for more languages.

Examples

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({
    "no1":  [100, 200, 300],
    "café": ["espresso", "latte", "mocha"],
    "t/f":  [True, False, None],
    "hmm":  ["aaa", "bbb", "ccc"],
    "都市":  ["東京", "大阪", "京都"],
})

Select columns with alphabetic names; note that accented characters and kanji are recognised as valid:

df.select(cs.alpha())
# shape: (3, 3)
# ┌──────────┬─────┬──────┐
# │ café     ┆ hmm ┆ 都市 │
# │ ---      ┆ --- ┆ ---  │
# │ str      ┆ str ┆ str  │
# ╞══════════╪═════╪══════╡
# │ espresso ┆ aaa ┆ 東京 │
# │ latte    ┆ bbb ┆ 大阪 │
# │ mocha    ┆ ccc ┆ 京都 │
# └──────────┴─────┴──────┘

Constrain the definition of "alphabetic" to ASCII characters:

df.select(cs.alpha(ascii_only=True))
# shape: (3, 1)
# ┌─────┐
# │ hmm │
# │ --- │
# │ str │
# ╞═════╡
# │ aaa │
# │ bbb │
# │ ccc │
# └─────┘

Select columns with non-ASCII alphabetic names :)

df.select(cs.alpha() - cs.alpha(ascii_only=True))
# shape: (3, 2)
# ┌──────────┬──────┐
# │ café     ┆ 都市 │
# │ ---      ┆ ---  │
# │ str      ┆ str  │
# ╞══════════╪══════╡
# │ espresso ┆ 東京 │
# │ latte    ┆ 大阪 │
# │ mocha    ┆ 京都 │
# └──────────┴──────┘

Select all columns except for those with alphabetic names:

df.select(~cs.alpha())
shape: (3, 2)
# ┌─────┬───────┐
# │ no1 ┆ t/f   │
# │ --- ┆ ---   │
# │ i64 ┆ bool  │
# ╞═════╪═══════╡
# │ 100 ┆ true  │
# │ 200 ┆ false │
# │ 300 ┆ null  │
# └─────┴───────┘

Select alphanumeric names:

# shape: (3, 4)
# ┌─────┬──────────┬─────┬──────┐
# │ no1 ┆ café     ┆ hmm ┆ 都市 │
# │ --- ┆ ---      ┆ --- ┆ ---  │
# │ i64 ┆ str      ┆ str ┆ str  │
# ╞═════╪══════════╪═════╪══════╡
# │ 100 ┆ espresso ┆ aaa ┆ 東京 │
# │ 200 ┆ latte    ┆ bbb ┆ 大阪 │
# │ 300 ┆ mocha    ┆ ccc ┆ 京都 │
# └─────┴──────────┴─────┴──────┘

Select alphanumeric names, constraining the definition to ASCII characters:

df.select(cs.alphanumeric(ascii_only=True))
# shape: (3, 2)
# ┌─────┬─────┐
# │ no1 ┆ hmm │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 100 ┆ aaa │
# │ 200 ┆ bbb │
# │ 300 ┆ ccc │
# └─────┴─────┘

pola-rs / polars

feat(python): Add new `alpha`, `alphanumeric` and `digit` selectors #16310

Examples

Codecov Report