paddymul / buckaroo

Buckaroo - the data wrangling assistant for pandas. Quickly explore dataframes, and run pandas commands via a GUI. Works inside the jupyter notebook.
https://paddymul.github.io/buckaroo/
BSD 3-Clause "New" or "Revised" License
228 stars 10 forks source link

Figure out better autocleaning comparison #220

Open paddymul opened 9 months ago

paddymul commented 9 months ago

Checks

How would you categorize this request. You can select multiple if not sure

Auto Cleaning, Performance

Enhancement Description

polars makes some autocleaning functionality very difficult, particularly comparing original to modfified across different dtypes. This makes it much more difficult to color and add tooltips to the resulting dataframe based on modifications.

pl.DataFrame({'a_raw':["not_parseable", "30"], 'a_cleaned': [None, 30]})
pl.select(pl.col("a_raw").eq("a_cleaned"))

which they shouldn't equal each other because their different types... but you cant do this either

pl.DataFrame({'a_raw': pl.Series(["not_parseable", 30], dtype=pl.Object), 'a_cleaned': [None, 30]})
pl.select(pl.col("a_raw").eq("a_cleaned"))

you can't even do this

pl.DataFrame({'a_raw':["not_parseable", 30], 'a_cleaned': [None, 30]})
pl.select(pl.struct(["a_raw", "a_cleaned"]).map_elements(lambda x: x[0] == x[1]))

Because you can't put an object into a struct

Pseudo Code Implementation

This might require writing some custom expressions. particularly a version of cast that returns a struct with the original

Prior Art

N/A