pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

The bin function of character variable #16601

Open cfkstat opened 4 months ago

cfkstat commented 4 months ago

Description

In the development of risk score card, character variables need to be divided into boxes by a box-splitting function, which is generally implemented by if-else statements. polars considers implementing a function, similar to pl.cut, to output corresponding groups given a list. Such as pl (" var "). strcut ([[" a ", "b"], [" c ", "d"]], the labels = [str ([" a ", "b"]), str ([" c ", "d"]), "other"]), the actual example as follows, need to use function directly, Instead of multiple pl.when statements. ` import polars as pl

in_plf = pl.DataFrame( {"var1":["a", "b", "c", "d", "e", "f", "g", "h"] ,"var2":["a", "b", "c", "d", "e", "f", "g", "h"] ,"var3":["a", "b", "c", "d", "e", "f", "g", "h"] } )

in_plf.with_columns( pl.when(pl.col("var1").is_in(["a", "b"])).then(pl.lit(str(["a", "b"]))).when( pl.col("var1").is_in(["c", "d"])).then(pl.lit(str(["c", "d"]))).otherwise(pl.lit("others")).alias("var1_bin") ,pl.when(pl.col("var2").is_in(["a", "b"])).then(pl.lit(str(["a", "b"]))).when( pl.col("var2").is_in(["c", "d"])).then(pl.lit(str(["c", "d"]))).otherwise(pl.lit("others")).alias("var2_bin") ,pl.when(pl.col("var3").is_in(["a", "b"])).then(pl.lit(str(["a", "b"]))).when( pl.col("var3").is_in(["c", "d"])).then(pl.lit(str(["c", "d"]))).otherwise(pl.lit("others")).alias("var3_bin") ) `

WX20240530-235120@2x

` var_split = { "var1":[["a", "b"], ["c", "d"]] ,"var2":[["a", "b"], ["c", "d"]] ,"var3":[["a", "b"], ["c", "d"]] }

in_plf.with_columns( *[pl.col(var).strcut(v, labels = [str(item) for item in split] + ["others"]).alias(var) for var, split in var_split.items()] ) `

cmdlineluser commented 4 months ago

If it helps, your example can also be written as:

df.with_columns(
    pl.when(pl.all().is_in(["a", "b"]))
      .then(pl.lit(str(["a", "b"])))
      .when(pl.all().is_in(["c", "d"]))
      .then(pl.lit(str(["c", "d"])))
      .otherwise(pl.lit("others"))
      .name.suffix("_bin")
)

It may also be related to the general issue of a simpler way to provide whens/thens e.g. https://github.com/pola-rs/polars/issues/5823