markfairbanks / tidypolars

Tidy interface to polars
http://tidypolars.readthedocs.io
MIT License
337 stars 11 forks source link

Using mutate() and columns in polars dataframe #217

Closed dataning closed 1 year ago

dataning commented 1 year ago

Big fan of your tidypolars, esepcially for people coding both in tidyverse and polars.

I am trying to figure out a solution where I want to use tidypolars with polars approach together on polars dataframe. I thought it would work nicely because they're just polars dataframe - looking the same. However, it sometimes gives me an error.

Starting from a simple one:

With the same dataframe output: pl.read_csv().columns - I can get the column names tp.read_csv().columns - I cannot get the column names

More interestingly, I was trying to use mutate() because tidyverse-style would be nicer. However, I cannot apply the mutate() directly to a polars dataframe if it's not directly coming through tidypolars; I can apply mutate() to the dataframe if I first convert a polars dataframe to pandas dataframe and convert it back to tidypolars polars dataframe. I suspect that it might have something to do with tibble formatting in the back but because the output dataframe looks identical to the typical polars dataframe so it sort of got me confused.

Would it be possible to use tidypolars alongside with polars dataframe?

markfairbanks commented 1 year ago

With the same dataframe output: pl.read_csv().columns - I can get the column names tp.read_csv().columns - I cannot get the column names

You can extract column names using tidypolars_df.names (much like using names(df) in R).

However, I cannot apply the mutate() directly to a polars dataframe if it's not directly coming through tidypolars

You can convert data to/from polars DataFrames with tp.from_polars()/.to_polars():

# Option 1
tp.from_polars(polars_df).mutate().to_polars().agg()

# Option 2 (using `.pipe()` method)
polars_df.pipe(tp.from_polars).mutate().to_polars().agg()

But regarding the bigger overall question

Short answer I think I'm going to build this functionality in a way (see #208).

Long answer In python methods (accessed using .method() syntax) belong to specific class. .mutate() is built for the Tibble class, and therefore won't work on polars DataFrames. Just like polars .with_columns() won't work on a pandas DataFrame or a tidypolars Tibble. It is impossible for me to add a method to the polars DataFrame class since I don't own that code (there is technically a hacky way but it is almost guaranteed to break internal polars code). That's why the Tibble class is necessary.

This is one massive disadvantage of building tools in python that try to extend functionality of an existing data frame library. Python's object-oriented structure causes this limitation. In R all data frame libraries (dplyr, data.table) are built on top of of R's base data.frame class. And functions can be made that operate differently depending on the type of object that is fed into it. This is what the S3 object oriented system allows. It's also more-or-less what the Julia language implements for its OOP system.

Even the solution proposed in #208 is sort of hacky, but it will allow people to work directly on polars DataFrames.

If you have any further questions or need something clarified feel free to ask in this issue.