Closed vincentarelbundock closed 1 year ago
Interesting that DataFrame.first and DataFrame.last is not implemented for py-polars, but they are in rust-polars just as you wrap them.
How should I re-write the slice function below to make length argument optional (None)?
you can do like this (inspired by py-polars wrapper )
fn slice(&self, offset: Robj, length: Robj) -> Result<LazyFrame, String> {
Ok(LazyFrame(self.0.clone().slice(
robj_to!(i64, offset)?,
robj_to!(Option, u32, length)?.unwrap_or(u32::MAX),
)))
}
when I get lost in type conversions, I find it helpful to break up the lines and let rust-analyzer explain what types I have currently made
Missing robj_to_u8 makes rob_to fail when I try to implement DataFrame.std()
you can cut a corner and use the extendr conversion like I did here in the beginning.
we can make this PR into a co-op and then I can add the u8 conversion OR just use extendr in this PR, and I file an issue to upgrade u8 conversions.
Could you supply a minimal example of implementing something like .max() on DataFrameGroupBy?
here's a minimal example + tidying up the GroupBy class
after zzz.R is updated you can do like this
GroupBy_max = function() {
self$agg(pl$all()$max())
}
and use case from py-polars
> df = pl$DataFrame(
+ a = c(1, 2, 2, 3, 4, 5),
+ b = c(0.5, 0.5, 4, 10, 13, 14),
+ c = c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE),
+ d = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana")
+ )
> df$groupby("d", maintain_order=TRUE)$max()
polars DataFrame: shape: (3, 4)
┌────────┬─────┬──────┬──────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪══════╡
│ Apple ┆ 3.0 ┆ 10.0 ┆ true │
│ Orange ┆ 2.0 ┆ 0.5 ┆ true │
│ Banana ┆ 5.0 ┆ 14.0 ┆ true │
└────────┴─────┴──────┴──────┘
Thanks for all the extra info! Everything seems to have worked without a hitch.
I used the u8
shortcut for now.
I'll look at a few more candidates for translation and let you know when I think this PR is ready for a review.
@sorhawell This PR now covers most of the trivial translations for DataFrame
, LazyFrame
, and GroupBy
. I think it's a good place to stop for now.
Let me know when you've had time to look at this and I can implement any change you need.
Once this is merged, I'll turn to the S3 methods PR https://github.com/pola-rs/r-polars/pull/107.
Depending on time, energy, etc., I might eventually come back to translate some of the trickier bits, but those will have to wait.
Sorry for the dirty commit history. Made some stupid mistakes with patrick
, but it should work now.
Sorry for the dirty commit history. Made some stupid mistakes with
patrick
, but it should work now.
no problemo, any PR gets squashed anyways :)
looks very good :) I will add to this PR some minor style changes later, I was just derping around in how to commit to a PR
I added some changes to the non-fallible rust methods. These do not need to return a Result, and if default arg value on R side, the extendr wrapper can be used directly as is with flag "use_extendr_wrapper".
with a few tests for null_count and estimated_size, this looks like a wrap :)
patrick tests look beautiful
Great! I will study your changes and comments, and will make changes as appropriate. It may take me a few days because of work + family stuff.
I pushed a simple test for .estimated_size()
.
Here’s something weird: .as_data_frame()
changes the values in the .null_count()
data frame.
Install the PR branch:
remotes::install_github("vincentarelbundock/r-polars@translation")
library(rpolars)
tmp = mtcars
tmp[1:2, 1:2] = NA
tmp[5, 3] = NA
a = pl$DataFrame(tmp)$null_count()
# correct
a
# polars DataFrame: shape: (1, 11)
# ┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
# │ mpg ┆ cyl ┆ disp ┆ hp ┆ ... ┆ vs ┆ am ┆ gear ┆ carb │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
# │ 2 ┆ 2 ┆ 1 ┆ 0 ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
# └─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
# incorrect
a$as_data_frame()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 9.881313e-324 9.881313e-324 4.940656e-324 0 0 0 0 0 0 0 0
Any ideas why we get this?
Sorry, I merged #112 because NEWS was not updated correctly. Could you please move the news item to the correct position?
I just reverted the NEWS commit. Not sure what's the best process to add it back, but for the record, here is what I had written:
- New methods implemented for DataFrame, LazyFrame, and GroupBy objects: min, max, mean, median, sum, std, var, first, last, head, tail, reverse, slice, null_count, estimated_size (#105 @vincentarelbundock)
## New Contributors
- @grantmcdermott made their first contribution in #81
- @vincentarelbundock made their first contribution in #105
Ah I see what you mean @eitsupi . Thanks for the merge. I updated NEWS in the right position.
nope, still broken. Giving up on this now. We can update the NEWS at the very end.
Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.
Currently r-polars maps arrow-u32 to R bit64::nteger64 which again is a hack to get i64 in R. Under the hood bit64::integer64 is just an i64 which is tagged with "SexpReal", if bit64 is not loaded the user is in for a surprise.
rpostgresql on other packages does the same. I prefer in future to map to f64 and let the connoisseurs actively opt for bit64 package
> library(bit64)
> library(rpolars)
> tmp = mtcars
> tmp[1:2, 1:2] = NA
> tmp[5, 3] = NA
> a = pl$DataFrame(tmp)$null_count()
> a
polars DataFrame: shape: (1, 11)
┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ ... ┆ vs ┆ am ┆ gear ┆ carb │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
│ 2 ┆ 2 ┆ 1 ┆ 0 ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
> as.data.frame(a)
mpg cyl disp hp drat wt qsec vs am gear carb
1 2 2 1 0 0 0 0 0 0 0 0
> str(as.data.frame(a))
'data.frame': 1 obs. of 11 variables:
$ mpg :integer64 2
$ cyl :integer64 2
$ disp:integer64 1
$ hp :integer64 0
$ drat:integer64 0
$ wt :integer64 0
$ qsec:integer64 0
$ vs :integer64 0
$ am :integer64 0
$ gear:integer64 0
$ carb:integer64 0
Got it, thanks. I added very simple .null_count() and .estimated_size() tests.
AFAICT, this completes my TODO list. Feel free to let me know; I'm happy to make more changes to this PR if helpful.
excellent many thanks @vincentarelbundock
Thanks for your guidance, edits, and prompt responses. This was a great contribution experience :)
Translated and tested:
DataFrame
:max
,mean
,median
,min
,sum
,var
,std
,first
,last
,head
,tail
,reverse
,slice
,null_count
,estimated_size
LazyFrame
:max
,mean
,median
,min
,sum
,var
,std
,first
,last
,head
,tail
,reverse
,slice
GroupBy
:max
,mean
,median
,min
,sum
,var
,std
,first
,last
,null_count
TODO:
.quantile
:QuantileInterpolOptions
inputis_duplicated()
: How do we return aChunkedArray
to R output?