pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
405 stars 35 forks source link

`DataFrame`: Simple translations #105

Closed vincentarelbundock closed 1 year ago

vincentarelbundock commented 1 year ago

Translated and tested:

TODO:

sorhawell commented 1 year ago

Interesting that DataFrame.first and DataFrame.last is not implemented for py-polars, but they are in rust-polars just as you wrap them.

sorhawell commented 1 year ago

How should I re-write the slice function below to make length argument optional (None)?

you can do like this (inspired by py-polars wrapper )

fn slice(&self, offset: Robj, length: Robj) -> Result<LazyFrame, String> {
        Ok(LazyFrame(self.0.clone().slice(
            robj_to!(i64, offset)?,
            robj_to!(Option, u32, length)?.unwrap_or(u32::MAX),
        )))
    }

when I get lost in type conversions, I find it helpful to break up the lines and let rust-analyzer explain what types I have currently made

image
sorhawell commented 1 year ago

Missing robj_to_u8 makes rob_to fail when I try to implement DataFrame.std()

you can cut a corner and use the extendr conversion like I did here in the beginning.

we can make this PR into a co-op and then I can add the u8 conversion OR just use extendr in this PR, and I file an issue to upgrade u8 conversions.

sorhawell commented 1 year ago

Could you supply a minimal example of implementing something like .max() on DataFrameGroupBy?

here's a minimal example + tidying up the GroupBy class

after zzz.R is updated you can do like this

GroupBy_max = function() {
  self$agg(pl$all()$max())
}

and use case from py-polars

> df = pl$DataFrame(
+         a = c(1, 2, 2, 3, 4, 5),
+         b = c(0.5, 0.5, 4, 10, 13, 14),
+         c = c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE),
+         d = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana")
+ )
> df$groupby("d", maintain_order=TRUE)$max()
polars DataFrame: shape: (3, 4)
┌────────┬─────┬──────┬──────┐
│ d      ┆ a   ┆ b    ┆ c    │
│ ---    ┆ --- ┆ ---  ┆ ---  │
│ str    ┆ f64 ┆ f64  ┆ bool │
╞════════╪═════╪══════╪══════╡
│ Apple  ┆ 3.0 ┆ 10.0 ┆ true │
│ Orange ┆ 2.0 ┆ 0.5  ┆ true │
│ Banana ┆ 5.0 ┆ 14.0 ┆ true │
└────────┴─────┴──────┴──────┘
vincentarelbundock commented 1 year ago

Thanks for all the extra info! Everything seems to have worked without a hitch.

I used the u8 shortcut for now.

I'll look at a few more candidates for translation and let you know when I think this PR is ready for a review.

vincentarelbundock commented 1 year ago

@sorhawell This PR now covers most of the trivial translations for DataFrame, LazyFrame, and GroupBy. I think it's a good place to stop for now.

Let me know when you've had time to look at this and I can implement any change you need.

Once this is merged, I'll turn to the S3 methods PR https://github.com/pola-rs/r-polars/pull/107.

Depending on time, energy, etc., I might eventually come back to translate some of the trickier bits, but those will have to wait.

vincentarelbundock commented 1 year ago

Sorry for the dirty commit history. Made some stupid mistakes with patrick, but it should work now.

sorhawell commented 1 year ago

Sorry for the dirty commit history. Made some stupid mistakes with patrick, but it should work now.

no problemo, any PR gets squashed anyways :)

sorhawell commented 1 year ago

looks very good :) I will add to this PR some minor style changes later, I was just derping around in how to commit to a PR

sorhawell commented 1 year ago

I added some changes to the non-fallible rust methods. These do not need to return a Result, and if default arg value on R side, the extendr wrapper can be used directly as is with flag "use_extendr_wrapper".

with a few tests for null_count and estimated_size, this looks like a wrap :)

sorhawell commented 1 year ago

patrick tests look beautiful

vincentarelbundock commented 1 year ago

Great! I will study your changes and comments, and will make changes as appropriate. It may take me a few days because of work + family stuff.

vincentarelbundock commented 1 year ago

I pushed a simple test for .estimated_size().

Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.

Install the PR branch:

remotes::install_github("vincentarelbundock/r-polars@translation")
library(rpolars)
tmp = mtcars
tmp[1:2, 1:2] = NA
tmp[5, 3] = NA
a = pl$DataFrame(tmp)$null_count()

# correct
a
# polars DataFrame: shape: (1, 11)
# ┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
# │ mpg ┆ cyl ┆ disp ┆ hp  ┆ ... ┆ vs  ┆ am  ┆ gear ┆ carb │
# │ --- ┆ --- ┆ ---  ┆ --- ┆     ┆ --- ┆ --- ┆ ---  ┆ ---  │
# │ u32 ┆ u32 ┆ u32  ┆ u32 ┆     ┆ u32 ┆ u32 ┆ u32  ┆ u32  │
# ╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
# │ 2   ┆ 2   ┆ 1    ┆ 0   ┆ ... ┆ 0   ┆ 0   ┆ 0    ┆ 0    │
# └─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘

# incorrect
a$as_data_frame()
#             mpg           cyl          disp hp drat wt qsec vs am gear carb
# 1 9.881313e-324 9.881313e-324 4.940656e-324  0    0  0    0  0  0    0    0

Any ideas why we get this?

eitsupi commented 1 year ago

Sorry, I merged #112 because NEWS was not updated correctly. Could you please move the news item to the correct position?

vincentarelbundock commented 1 year ago

I just reverted the NEWS commit. Not sure what's the best process to add it back, but for the record, here is what I had written:

- New methods implemented for DataFrame, LazyFrame, and GroupBy objects: min, max, mean, median, sum, std, var, first, last, head, tail, reverse, slice, null_count, estimated_size (#105 @vincentarelbundock)

## New Contributors

- @grantmcdermott made their first contribution in #81
- @vincentarelbundock made their first contribution in #105
vincentarelbundock commented 1 year ago

Ah I see what you mean @eitsupi . Thanks for the merge. I updated NEWS in the right position.

vincentarelbundock commented 1 year ago

nope, still broken. Giving up on this now. We can update the NEWS at the very end.

sorhawell commented 1 year ago

Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.

Currently r-polars maps arrow-u32 to R bit64::nteger64 which again is a hack to get i64 in R. Under the hood bit64::integer64 is just an i64 which is tagged with "SexpReal", if bit64 is not loaded the user is in for a surprise.

rpostgresql on other packages does the same. I prefer in future to map to f64 and let the connoisseurs actively opt for bit64 package


> library(bit64)
> library(rpolars)
> tmp = mtcars
> tmp[1:2, 1:2] = NA
> tmp[5, 3] = NA
> a = pl$DataFrame(tmp)$null_count()
> a
polars DataFrame: shape: (1, 11)
┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp  ┆ ... ┆ vs  ┆ am  ┆ gear ┆ carb │
│ --- ┆ --- ┆ ---  ┆ --- ┆     ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ u32 ┆ u32 ┆ u32  ┆ u32 ┆     ┆ u32 ┆ u32 ┆ u32  ┆ u32  │
╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
│ 2   ┆ 2   ┆ 1    ┆ 0   ┆ ... ┆ 0   ┆ 0   ┆ 0    ┆ 0    │
└─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
> as.data.frame(a)
  mpg cyl disp hp drat wt qsec vs am gear carb
1   2   2    1  0    0  0    0  0  0    0    0
> str(as.data.frame(a))
'data.frame':   1 obs. of  11 variables:
 $ mpg :integer64 2 
 $ cyl :integer64 2 
 $ disp:integer64 1 
 $ hp  :integer64 0 
 $ drat:integer64 0 
 $ wt  :integer64 0 
 $ qsec:integer64 0 
 $ vs  :integer64 0 
 $ am  :integer64 0 
 $ gear:integer64 0 
 $ carb:integer64 0 
vincentarelbundock commented 1 year ago

Got it, thanks. I added very simple .null_count() and .estimated_size() tests.

AFAICT, this completes my TODO list. Feel free to let me know; I'm happy to make more changes to this PR if helpful.

sorhawell commented 1 year ago

excellent many thanks @vincentarelbundock

vincentarelbundock commented 1 year ago

Thanks for your guidance, edits, and prompt responses. This was a great contribution experience :)