Closed eitsupi closed 5 months ago
Not familiar with savvy
so I don't see how it would simplify the code while keeping the same functionalities, can you add some examples of simplification?
@etiennebacher It would be quicker to look at a real example. The following two R packages are based on savvy and DataFusion^1. https://github.com/yutannihilation/datafusion-r https://github.com/eitsupi/r-glaredb
For example, implementing the Arrow C stream interface is much easier than extendr. The following few lines will pull nanoarrow_array_stream, etc. into the Rust side.
use arrow::ffi_stream::{ArrowArrayStreamReader, FFI_ArrowArrayStream};
let stream_reader = unsafe {
let stream = savvy::ExternalPointerSexp::try_from(stream_ptr)?
.cast_mut_unchecked::<FFI_ArrowArrayStream>();
ArrowArrayStreamReader::from_raw(stream).map_err(|e| e.to_string())?
};
It would also allow for more general Rust rules for file structure, such as the ability to split impl blocks into multiple files. (extendr/extendr#739)
Currently, the file structure on the Rust side is quite different from that of py-polars and uses a uniquely defined error type, so it is necessary to write functions according to completely different rules than for general packages using extendr. This is fine for this package alone (?), but if we are creating something like pyo3-polars, I expect that similar code will be required for that package as well, making maintenance quite cumbersome. So I think it might be worthwhile to completely rewrite r-polars in savvy to be a copy of py-polars.
Another thing I am thinking is that just as Python Polars have, for example, the DataFrame class (defined in Python) and the (internal) PyDataFrame class (defined in Rust), it would be better to have a dual structure for each class in R Polars. Note: extendr and savvy have different ways of representing the associated functions of Struct on the R side (I feel that savvy is more appropriate).
@etiennebacher Although only basic type conversion from R to Polars has been implemented yet, I have created a polars binding using savvy.
Note that the directory structure matches that of py-polars. This is not possible with extendr, which cannot have multiple impl blocks (extendr/extendr#538). https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/construction.rs#L7-L8 (ref. https://github.com/pola-rs/polars/blob/9af88f1fdee6f3af5b4e5ac646ecd13fce80bca0/py-polars/src/series/construction.rs#L58-L59) https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/mod.rs#L30-L31 (ref. https://github.com/pola-rs/polars/blob/9af88f1fdee6f3af5b4e5ac646ecd13fce80bca0/py-polars/src/series/mod.rs#L70-L71)
Matching py-polars will make it easier to follow upstream and contribute. The current implementation of r-polars is too complex.
Thanks @eitsupi, that looks great and much easier to follow and understand than the current Rust implementation (so far), thanks! I won't have much time (or skill anyway) to explore this in the next few days but I have a few questions coming to my mind (all of this supposing we move to savvy
):
impl
for Expr, DataFrame, etc. need to change? In other words can we just copy-paste the big impl
blocks we already have? Edit: just saw in https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/mod.rs#L30-L31 that apparently we would just need to replace RResult<Self>
by savvy::Result<Self>
r-polars
extension that relies on extendr
while we rely on savvy
? Are they interoperable?unwrap()
thing in R would still be necessary)extendr_concurrent.rs
to savvy
? That's the biggest black box to meThe current implementation of r-polars is too complex.
100% agree
Thanks for checking that.
- does the structure of
impl
for Expr, DataFrame, etc. need to change? In other words can we just copy-paste the bigimpl
blocks we already have? Edit: just saw in eitsupi/neo-r-polars@f2b726f
/src/rust/src/series/mod.rs#L30-L31 that apparently we would just need to replaceRResult<Self>
bysavvy::Result<Self>
Basically, yes, but the type of input is different.
Currently r-polars takes input from R as Robj
, but savvy requires the use of different arguments for each type.
https://yutannihilation.github.io/savvy/guide/key_ideas.html
For example, in the following, you will see that the integer and double vectors are converted to Series in separate functions (I tried to see if this could be handled in a single function, but it seemed impossible. But I think separating the functions makes the correspondence between one S3 method and one Rust function clearer and easier to understand). https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/construction.rs#L14-L28
- should we expect anything problematic if someone wants to make an
r-polars
extension that relies onextendr
while we rely onsavvy
? Are they interoperable?
As long as the Arrow C interface is used, there should be no problem (currently the C interface is intended to be used via nanoarrow, see #1138)
- would the transition be in several steps or one massive PR to change all the Rust part?
I don't think a phased transition is possible. Almost everything needs to be rewritten in a huge PR / switch the branch. Or I was imagining something like having the two exist in parallel in completely separate branches (like v1 and v2 of Docker Compose).
- would the structure of errors change? (I mean by that whether the whole
unwrap()
thing in R would still be necessary)
This is both an advantage and a disadvantage, but the error structure is completely different. That is, savvy, unlike extendr, has the ability from the start to handle errors so we can use it to abandon our own Result type, but it does not provide anything like the current stack trace. https://yutannihilation.github.io/savvy/guide/error.html The current implementation should be better with respect to error messages.
- is it possible to translate
extendr_concurrent.rs
tosavvy
? That's the biggest black box to me
This seems to be related to the Rust code to bypass R's single-threaded limitation to execute R code? I think the root cause is that R is single-threaded but there is no such thing as a Python GIL, so perhaps something like this is needed everywhere (I don't understand it either)
I don't think there is anything extendr-specific that can be translated, but I don't know if this is the right implementation to begin with (like the segmentation fault we recently got on Linux).
Compared to the above example, the error message would look like this:
> as_polars_series(geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
Error in as_polars_series.default(geos::as_geos_geometry("LINESTRING (0 1, 3 9)")) :
Unsupported class: geos_geometry
> rlang::global_entrace()
> as_polars_series(geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
---
Backtrace:
▆
1. ├─neopolars::as_polars_series(geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
2. └─neopolars:::as_polars_series.default(geos::as_geos_geometry("LINESTRING (0 1, 3 9)")) at neo-r-polars/R/as_polars_series.R:3:3
> as_polars_df(list(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)")))
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
---
Backtrace:
▆
1. ├─neopolars::as_polars_df(list(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)")))
2. └─neopolars:::as_polars_df.list(list(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))) at neo-r-polars/R/as_polars_df.R:3:3
3. ├─neopolars:::wrap(PlRDataFrame$init(lapply(x, function(column) as_polars_series(column)$`_s`))) at neo-r-polars/R/as_polars_df.R:13:3
4. ├─PlRDataFrame$init(lapply(x, function(column) as_polars_series(column)$`_s`)) at neo-r-polars/R/series-series.R:4:3
5. │ └─neopolars:::.savvy_wrap_PlRDataFrame(...) at neo-r-polars/R/000-wrappers.R:58:3
6. └─base::lapply(x, function(column) as_polars_series(column)$`_s`) at neo-r-polars/R/000-wrappers.R:43:3
7. └─neopolars (local) FUN(X[[i]], ...)
8. ├─neopolars::as_polars_series(column) at neo-r-polars/R/as_polars_df.R:13:13
9. └─neopolars:::as_polars_series.default(column) at neo-r-polars/R/as_polars_series.R:3:3
(This is not an appropriate example since it seems unlikely that we can create an error-prone situation within Rust at the moment, but anyway, the Result type on the R side is not necessary.)
I implemented <Series>$struct$unnest()
to see how the error looks from the Rust side (https://github.com/eitsupi/neo-r-polars/commit/43efe43c1c3ebbe6dbdf0c34357f10f72aed0919).
> as_polars_series(mtcars)$struct$unnest()
shape: (32, 11)
┌──────┬─────┬───────┬───────┬───┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪═════╪═══════╪═══════╪═══╪═════╪═════╪══════╪══════╡
│ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0 ┆ 4.0 │
│ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0 ┆ 4.0 │
│ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 1.0 │
│ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 3.0 ┆ 1.0 │
│ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0 ┆ 2.0 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 5.0 ┆ 2.0 │
│ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 4.0 │
│ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 6.0 │
│ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 8.0 │
│ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 2.0 │
└──────┴─────┴───────┴───────┴───┴─────┴─────┴──────┴──────┘
> as_polars_series(1)$struct$unnest()
Error: SchemaMismatch(ErrString("invalid series dtype: expected `Struct`, got `f64`"))
> rlang::global_entrace()
> as_polars_series(1)$struct$unnest()
Error:
! SchemaMismatch(ErrString("invalid series dtype: expected `Struct`, got `f64`"))
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error:
! SchemaMismatch(ErrString("invalid series dtype: expected `Struct`, got `f64`"))
---
Backtrace:
▆
1. └─as_polars_series(1)$struct$unnest()
2. └─neopolars:::series_struct_unnest(.self) at neo-r-polars/R/series-struct.R:5:19
3. ├─neopolars:::wrap(self$`_s`$struct_unnest()) at neo-r-polars/R/series-struct.R:12:3
4. └─self$`_s`$struct_unnest() at neo-r-polars/R/series-series.R:4:3
5. └─neopolars:::.savvy_wrap_PlRDataFrame(...) at neo-r-polars/R/000-wrappers.R:72:5
I took a look at the code in your neo-r-polars repo but didn't play a lot with it yet. I'm trying to see how this transition to savvy
would work and how we should organize that.
I think it'd be nice to have a doc that summarizes why we want to move to savvy
(your comments above could be a good basis for that). I get there are some advantages to do so, e.g. better file organization, but it's not obvious why this can't be improved while keeping the extendr
framework. You mentioned that
Note that the directory structure matches that of py-polars. This is not possible with extendr, which cannot have multiple impl blocks
but maybe this could be implemented in extendr
upstream instead? (I see there's some discussion going on in https://github.com/extendr/extendr/issues/538 so there might be some development there soon).
Also, I'm not confident I can adapt some more complex code (e.g. extendr_concurrent.rs
) in savvy
.
This is not to say that this isn't worth it, but it's quite a massive rewrite so we should be clear why we do it.
We could list the type of changes that would be necessary. You already started making a proof-of-concept so you know more what is needed. Feel free to edit this list.
Currently r-polars takes input from R as Robj, but savvy requires the use of different arguments for each type.
I suppose this means we need input checking on the R side. It shouldn't be too hard to implement if we use rlang
standalone functions to check types, but it will be annoying to do so given the size of the code base.
extendr_concurrent.rs
stuffSupposing we move to savvy
, making PRs to a separate branch for savvy
and then merging to main
when the transition is complete makes the most sense to me. Still, this means that while we focus on savvy
we don't add new features since we would have to implement them both in main
and in savvy
(not sure this is clear).
It would be nice if neo-r-polars
could be a minimal PoC with 1-2 methods for Expr
, DataFrame
, etc. (maybe this is already what you had in mind). Basically, we want to be sure that:
Once all of this is guaranteed in this minimal PoC, then we can move on and add all the Expr/DataFrame/... stuff that requires a lot of tedious (but relatively easy) work. Until we have this, I'm not sure I'll spend much time on neo-r-polars
given my limited time.
What do you think about all this?
Sidenote about errors: would it make sense to return a string with a specific "error" attribute? This would allow more customization of the error message in R. For instance, if we use rlang
then we can make the error message prettier and reformulate some stuff:
foo_unwrap <- function(x) {
error <- attributes(x)$error_type
if (is.null(error)) {
return(x)
}
err_type <- switch(
error,
"ColumnNotFound" = "Column not found",
"ComputeError" = "Operation failed"
)
err_type <- paste0(err_type, ":")
rlang::abort(
c(
err_type,
" " = x
)
)
}
# Function worked: no msg
foo_unwrap(1)
#> [1] 1
# Function failed
error_output <- c("a")
attr(error_output, "error_type") <- "ColumnNotFound"
foo_unwrap(error_output)
#> Error in `foo_unwrap()`:
#> ! Column not found:
#> a
# Function failed
error_output <- c("arithmetic on string and numeric not allowed, try an explicit cast first")
attr(error_output, "error_type") <- "ComputeError"
foo_unwrap(error_output)
#> Error in `foo_unwrap()`:
#> ! Operation failed:
#> arithmetic on string and numeric not allowed, try an explicit cast first
Thanks for taking a look at this. I will write about motivation in the README once the minimum functionality (e.g. the examples in the README of this repository) is implemented.
I suppose this means we need input checking on the R side.
Yes. So basically we need to implement the branch using the S3 method.
For example, the default method of as_polars_series()
raises an error.
https://github.com/eitsupi/neo-r-polars/blob/437680d2573f09ee0b08e301937d1b480b85b54f/R/as_polars_series.R
I think it is clearer compared to the current implementation which has a huge match arm on the Rust side.
@etiennebacher Thanks for considering the error messages. I obviously lack experience with rlang and am struggling to understand the basic strategy.
I added rlang imports because I obviously needed the power of rlang's list2 when implementing GroubBy's methods (https://github.com/eitsupi/neo-r-polars/commit/27d015e9b72b9ec0f0aeadfc96083485f494f983).
Using try_fetch within the wrap function, which is always used when wrapping a Rust object to an R object, I found that I could supplement the error with abort as follows.
> list(a = geos::as_geos_geometry("LINESTRING (0 1, 3 9)")) |> as_polars_df()
Error in `as_polars_df()`:
! Evaluation failed.
Caused by error in `FUN()`:
! Unsupported class for `as_polars_series()`: geos_geometry
Run `rlang::last_trace()` to see where the error occurred.
> as_polars_df(mtcars)$group_by("cyl")$head(-1)
Error:
! Evaluation failed.
Caused by error:
! n should not be negative
Run `rlang::last_trace()` to see where the error occurred.
Note that "n should not be negative" here is an error that occurred on the Rust side. https://github.com/eitsupi/neo-r-polars/blob/5ca4a19dc81fddf46a7a4ee9cc9137d8a6a0a599/src/rust/src/conversion/mod.rs#L116-L130
> as_polars_df(mtcars)$group_by("cyl")$head(-Inf)
Error:
! Evaluation failed.
Caused by error:
! n should not be negative
Run `rlang::last_trace()` to see where the error occurred.
> as_polars_df(mtcars)$group_by("cyl")$head(Inf)
Error:
! Evaluation failed.
Caused by error:
! n should not be greater than 18446744073709551615
Run `rlang::last_trace()` to see where the error occurred.
> as_polars_df(mtcars)$group_by("cyl")$head(NaN)
Error:
! Evaluation failed.
Caused by error:
! n should not be NaN
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/rlang_error>
Error:
! Evaluation failed.
Caused by error:
! n should not be NaN
---
Backtrace:
▆
1. └─as_polars_df(mtcars)$group_by("cyl")$head(NaN)
2. └─self$df$lazy()$group_by(!!!self$by, maintain_order = self$maintain_order)$head(n) at neo-r-polars/R/dataframe-group_by.R:28:3
3. ├─neopolars:::wrap(self$lgb$head(n)) at neo-r-polars/R/lazyframe-group_by.R:27:3
4. │ └─rlang::try_fetch(...) at neo-r-polars/R/utils-wrap.R:2:3
5. │ ├─base::tryCatch(...)
6. │ │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
7. │ │ └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
8. │ │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
9. │ └─base::withCallingHandlers(...)
10. └─self$lgb$head(n)
11. └─neopolars:::.savvy_wrap_PlRLazyFrame(...) at neo-r-polars/R/000-wrappers.R:433:5
Run rlang::last_trace(drop = FALSE) to see 4 hidden frames.
I think it would be great to have a dedicated error message for each method, but given that there is a standard rlang backtrace, this seems sufficient to me.
Thanks to @etiennebacher, error handling based on rlang has been implemented, and I explained Motivation and current status in the README. https://github.com/eitsupi/neo-r-polars
@sorhawell @vincentarelbundock @Sicheng-Pan @grantmcdermott Could you take a look at that? If we switch to savvy, the most distressing thing is that we will lose the "collect in background" feature that @Sicheng-Pan implemented in GSoC 2023 in the short term. (Now that polars can pass query plans to other computers without actively evaluating them, I suppose it is possible to implement this in a way that passes the query plan directly to other processes.)
@eitsupi Thanks for the prototype, and it looks very promising. Probably we can implement collect_in_background
with the new polars feature.
Even if we stick to the current implementation, it seems possible to migrate without too much effort, if we disable the support for the usage of R functions in queries (just for this feature)
I haven’t been able to test it in-depth, but my high-level feedback is that this all sounds good to me. Ensuring a healthy dev experience is something that we should be prioritising first and foremost. If the switch to savvy and the addition of the rlang dependency make life easier for you @eitsupi (and you @etiennebacher), then that’s all to the good. It’s infinitely preferable to make some dependency changes than overwhelm the primary maintainer(s). I’ll try to test neo-polars on some of my existing codebases when I get a sec. But if the test suite is already passing then I say go for it :-)
Thanks all for your feedback. That prototype seems to be working generally well and I hope to push it as a new branch in this repository within the next few days.
I have opened a new issue #1152 because this rewrite is not just a migration to savvy, but also complete change of structure on the R side.
I don't think this can be accomplished in a short term, but it would be far easier to maintain by eliminating the large amount of code written to return a Result type to R with extendr. I also think we can simplify the source code in several other points.