Rewrite based on savvy and rlang

eitsupi commented 5 months ago

Continued from #942 and #1126

(The following text is copied from https://github.com/eitsupi/neo-r-polars's README)

Motivation

I have been developing r-polars for over a year, and I felt that a significant rewrite was necessary. r-polars is a clone of py-polars, but the package structure is currently quite different. Therefore, it was difficult to keep up with frequent updates.

I thought that now, around the release of Python Polars 1.0.0, is a good time for a complete rewrite, so I decided to try it.

There are several reasons to rewrite r-polars on both the Rust and R sides.

Rust side

Appropriate file division. Due to the limitations of extendr, it is not possible to place multiple impl blocks. (extendr/extendr#538)
Error handling. There is a lot of custom code to use the Result type with extendr, which is quite different from other packages based on extendr. (extendr/extendr#650)
Simplify type conversion. The code is difficult to follow because it uses a macro called robj_to for type conversion (at least in rust-analyzer).

About 1 and 2, I expect that switching from extendr to savvy will improve the situation.

For 3, in py-polars and nodejs-polars, a thin Wrap struct wraps other types and processes them with standard From traits etc., which I think makes the code cleaner.

R side

The structure of classes. In py-polars, the strategy is that classes defined on the Rust side (e.g., PyDataFrame) are wrapped by classes defined on the Python side (e.g., DataFrame). In r-polars, a complex strategy is adopted to update classes created by Rust side/extendr (e.g., RPolarsDataFrame) with a lot of custom code. (This is also related to the fact that extendr makes associated functions of Rust structs members of R classes. savvy does not mix associated functions and methods.)
S3 methods first. This is also related to the Rust side, in the current r-polars, generic functions like as_polars_series were added later, so there are several places where type conversion from R to Polars is done on the Rust side, making it difficult to understand where the type conversion is done. If type conversion from R to Polars is done with two generic functions, as_polars_series and as_polars_expr, the code will be much simpler and customization from the R side will be possible.
Error handling. Currently, r-polars has its own Result type on the R side, and error handling is done through it. The backtrace generated that is quite easy to understand, but it is not necessarily easy to use when using polars internally in other packages, such as testthat::expect_error().
Based on rlang. Currently, r-polars has no R package dependencies. This is great, but that includes a degraded copy of list2() instead of the convenient functions in the rlang package. rlang is a lightweight R package, and I feel that it is more beneficial to depend on the convenient functions of rlang than to stick to no dependencies.

1 and 3 are also related to the fact that it is built with extendr, and it seems that switching to savvy is appropriate here as well. If we abandon the current Result type on the R side, it is natural to use rlang for error handling, so from that perspective, it is reasonable to depend on rlang in 4.

etiennebacher commented 5 months ago

Thanks for all the work, I'll take a look this weekend after #1147

eitsupi commented 5 months ago

Significant changes that may affect users are noted in the NEWS file. https://github.com/eitsupi/neo-r-polars/blob/60502b13e9c21ef38c733abc24477671e0e13a2a/NEWS.md

eitsupi commented 4 months ago

I pushed the neo-r-polars' main branch to this repository so that we can start the migration work. https://github.com/pola-rs/r-polars/commit/f7316d3da98cac1b373cdeb586f4033435a88bc9

For now, the new next branch lacks developer tools and other settings, and switching to this branch in the local repository of r-polars should detect a large number of diffs due to differences in .gitignore, so I recommend that do not switch to the branch.

eitsupi commented 4 months ago

Documentation for translation from Python Polars has been pushed. I hope it comes across as fairly simple. https://github.com/pola-rs/r-polars/blob/63790cd069cd736e650a5aabf7900f197490ece9/DEVELOPMENT.md

etiennebacher commented 4 months ago

Thanks, that is very helpful. I noticed that the class names have changed, e.g from RPolarsExpr to polars_expr. Is there any reason for this? I think the current class names are already clear and unique enough to avoid collisions with classes from other packages.

Similarly, the current Struct is called RPolarsExpr but in the rewrite it is PlRExpr. I like the fact that those names are identical on the R and Rust sides, is it possible to keep them like this?

etiennebacher commented 4 months ago

Also, related to the post on Mastodon asking for help, I think it would be helpful to have a checklist (either at the top of this issue or in a new one) with more details on the progress of the rewrite. It doesn't have to be super detailed but simply have an idea of how far we are, where should a new contributor focus, etc. For instance something like this:

[ ] DataFrame
- [ ] translation
- [ ] tests
[ ] LazyFrame
- [ ] translation
- [ ] tests
[ ] Expr
- [ ] str
- [ ] translation
- [ ] tests
- [ ] arr
- [ ] ...

eitsupi commented 4 months ago

Thanks for taking a look at this. I will create the task list at another epic issue soon as I can, as it is definitely needed.

I noticed that the class names have changed, e.g from RPolarsExpr to polars_expr. Is there any reason for this? I think the current class names are already clear and unique enough to avoid collisions with classes from other packages.

There are two reasons.

To avoid conflicts between the current polars package and the neopolars package. This is also why the names of the structs on the Rust side is something like PlRExpr. This justification disappears when there is no longer a need for coexistence (i.e., when the neopolars package is renamed to polars).
Because the class name of CamelCase is not considered to be very common in R. In other words, the current class names such as RPolarsExpr are a compromise due to the restrictions of the extendr that Rust's struct and R's classes cannot be named separately, and the need to avoid class name conflicts with other packages. But if class names can be given only on the R side I feel that the snake_case is more appropriate than the CamelCase. Of course, this is debatable.

Similarly, the current Struct is called RPolarsExpr but in the rewrite it is PlRExpr. I like the fact that those names are identical on the R and Rust sides, is it possible to keep them like this?

No, these are actually different things. That is, polars_expr and PlRExpr already exist on the R side, and PlRExpr is an internal class that is not normally touched by the user. This is the same as Python Polars having two classes: the Expr class and the PyExpr class (from the Rust struct).

In Python:

>>> import polars as pl
>>> pl.lit(1).__class__.__name__
'Expr'
>>> pl.lit(1)._pyexpr.__class__.__name__
'PyExpr'

In R:

> pl$lit(1) |> class()
[1] "polars_expr"   "polars_object"

> pl$lit(1)$`_rexpr` |> class()
[1] "PlRExpr"

Unlike the current RPolarsExpr, PlRExpr is an internal class, so I have tentatively named it PlRExpr because we think it is safe even if it does not contain the word "polars" (although it is an internal class, it is not without risk of collision, so I have not made it RExpr).

etiennebacher commented 4 months ago

I see, thanks for the explanations.

There are two reasons.

To avoid conflicts between the current polars package and the neopolars package. This is also why the names of the structs on the Rust side is something like PlRExpr. This justification disappears when there is no longer a need for coexistence (i.e., when the neopolars package is renamed to polars).

Because the class name of CamelCase is not considered to be very common in R. In other words, the current class names such as RPolarsExpr are a compromise due to the restrictions of the extendr that Rust's struct and R's classes cannot be named separately, and the need to avoid class name conflicts with other packages. But if class names can be given only on the R side I feel that the snake_case is more appropriate than the CamelCase. Of course, this is debatable.

I don't think the style of class names is important. To me, the two main requirements is that they must be clear and unique, and I think RPolarsExpr and variants match those requirements. We already break a lot of code at each release, I'd rather not break the class names in addition to that.

Bottom line, I'd rather keep the current RPolars* names. Open to discuss too ofc.

Sidenote: I didn't find conventions for class names (in general, I feel R code lacks conventions anyway). I also didn't find packages with CamelCase class names (can't say I searched a lot though) but some prominent packages do not use snake_case:

class(Matrix::Matrix(0, 3, 2))
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

Packages using S4 (like those on BioConductor) also usually have camel case.

eitsupi commented 4 months ago

I didn't find conventions for class names (in general, I feel R code lacks conventions anyway).

I think the most commonly used style guides are the Tidyverse style guide and Google's Style guide (of course Google publishes Style guides for many languages and I believe they are very popular).

https://style.tidyverse.org/syntax.html#object-names https://google.github.io/styleguide/Rguide.html

Both of these encourage the use of snake_case for class names.

ju6ge commented 3 months ago

Hi @eitsupi @etiennebacher ,

I am not convinced that switching from extendr to savy will solve the maintenance problems. First of I have taken a quick look at savvy and neo-r-polars. A rewire in of it's own is already a quite daunting task, and generating the bindings is the simple part. The hard part is doing integrating the code such that it feels "natural" in the R world. This is where most of the complexity of r-polars comes from. Generating the bindings with savvy is a lot more manual that doing so with extendr. Let me explain:

// savvy taken from `neo-r-polars`
#[savvy]
#[derive(Clone)]
pub struct PlRDataFrame {
    pub df: DataFrame,
}

impl From<DataFrame> for PlRDataFrame {
    fn from(df: DataFrame) -> Self {
        Self { df }
    }
}

impl TryFrom<EnvironmentSexp> for &PlRDataFrame {
    type Error = String;

    fn try_from(env: EnvironmentSexp) -> Result<Self, String> {
        let ptr = env
            .get(".ptr")
            .expect("Failed to get `.ptr` from the object")
            .ok_or("The object is not a valid polars data frame")?;
        <&PlRDataFrame>::try_from(ptr).map_err(|e| e.to_string())
    }
}

// extendr from current `r-polars`
#[derive(Debug, Clone)]
pub struct RPolarsDataFrame(pub pl::DataFrame);

impl From<pl::DataFrame> for RPolarsDataFrame {
    fn from(item: pl::DataFrame) -> Self {
        RPolarsDataFrame(item)
    }
}

The difference here is that for savvy we have to covert from R to Rust datatypes manually. While in extendr this is done for us. This is not just the case for self defined structs but for all other datatypes as well. The rust part of the code needs to expect Sexp and R specific datatypes in function signatures and convert them to rust native types manually. This increases the complexity on the rust side while also polluting the function interface on the rust side. In extendr I can just define a function to accept a rust String or u64 and extendr will handle the conversion.

Yes there are limitations to extendr but I the people from extendr are interested in the pain points of r-polars and are willing to improve their library to our needs. I think it is a lot easier to work with them. That is what @CGMossa is suggesting. Switching to savvy does not provide a clear benefit.

Rust side

1. Appropriate file division. Due to the limitations of [extendr](https://github.com/extendr/extendr),
   it is not possible to place multiple impl blocks.
   ([extendr/extendr#538](https://github.com/extendr/extendr/issues/538))

2. Error handling.
   There is a lot of custom code to use the Result type with extendr, which is quite different from other packages based on extendr.
   ([extendr/extendr#650](https://github.com/extendr/extendr/issues/650))

Both of these things are things where extendr is working on solutions that we can test and improve. extendr wants the feedback of r-polars for their implementation so they can move forward merging the proposals confidently.

The rest of the point are not issues with extendr but with the current implementation on r-polars side. There is no need for a different bindings backend to improve the state of affairs here.

3. Simplify type conversion.
   The code is difficult to follow because it uses a macro called `robj_to` for type conversion (at least in rust-analyzer).

Then the solution is to remove the macro, it is the same task no matter the bindings backend.

For 3, in py-polars and nodejs-polars, a thin Wrap struct wraps other types and processes them with standard From traits etc., which I think makes the code cleaner.

I don't get your point here. extendr is very similar to pyo3. I know because I am maintaining an internal library that exports rust types to python and r at the same time.

All of the points below, are things that need to be improved on the r-polars side of things anyways. I think it would be a lot simpler to address them step by step using extendr without switching to savvy. Switching to savvy and addressing all of the points below is a lot more work that just changing things step by step. The switch to savvy will like involve replacing everything in extendr to savvy first and then addressing the things below, so why not just address the things below directly instead. I don't think that anything on that list is easier with savvy. It is the same amount of work either way.

R side

1. The structure of classes.
   In py-polars, the strategy is that classes defined on the Rust side (e.g., `PyDataFrame`) are wrapped by classes defined on the Python side (e.g., `DataFrame`).
   In r-polars, a complex strategy is adopted to update classes created by Rust side/extendr (e.g., `RPolarsDataFrame`) with a lot of custom code.
   (This is also related to the fact that extendr makes associated functions of Rust structs members of R classes. savvy does not mix associated functions and methods.)

2. S3 methods first.
   This is also related to the Rust side, in the current r-polars, generic functions like `as_polars_series` were added later,
   so there are several places where type conversion from R to Polars is done on the Rust side, making it difficult to understand where the type conversion is done.
   If type conversion from R to Polars is done with two generic functions, `as_polars_series` and `as_polars_expr`, the code will be much simpler and customization from the R side will be possible.
3. Error handling.
   Currently, r-polars has its own Result type on the R side, and error handling is done through it.
   The backtrace generated that is quite easy to understand, but it is not necessarily easy to use when using polars internally in other packages, such as `testthat::expect_error()`.
4. Based on `rlang`.
   Currently, r-polars has no R package dependencies. This is great,
   but that includes [a degraded copy of `list2()`](https://github.com/pola-rs/r-polars/blob/6eac27a0766d2b6ca92a72c1c7fa76eaeb58bb98/R/dotdotdot.R#L1-L20)
   instead of the convenient functions in the `rlang` package.
   `rlang` is a lightweight R package, and I feel that it is more beneficial to depend on the convenient functions of `rlang` than to stick to no dependencies.

The last point I have is, that it very unclear how savvy will play together with extendr. What happens if another rust library that exports bindings to R using extendr wants to export a polars dataframe as the result of a function call. Currently this is possible, but if r-polars uses savvy it is unclear what will happen. Savvy would probably need to enable compatibility with extendr which seems unlikely. Or the exporting library would need to switch to savvy as well.

struct RDataQuery { … }

#[extendr]
impl RDataQuery {
   …
   pub fn fetch() -> RPolarsDataframe {
    …
    } 
}

Anyway this are my thoughts on this proposal, if you think I am wrong and want to this either way I am not going to stop you. But I don't have the time to help with the transition. My priority is in maintaining the code I have, this is where my motivation to contribute to r-polars is coming from. But switching to savvy adds a lot of maintenance on my end as a user of this library as well. I am willing to contribute to r-polars and keep extendr up to date and work with extendr to resolve the issues r-polars is having.

I am going to leave you with this quote, I don't know who to attribute it to, but I think it applies to many situations ;)

The sum of all problems stays constant

eitsupi commented 3 months ago

@ju6ge Thank you for looking at this.

The difference here is that for savvy we have to covert from R to Rust datatypes manually.

Perhaps you have a misunderstanding about savvy. savvy can take R objects based on Rust structs as arguments. https://yutannihilation.github.io/savvy/guide/struct.html

extendr actually requires more complex processing to raise the appropriate error. Please compare the following two.

Here, in the next branch implementation, we can take PlRExpr directly as an argument, but in the main branch, we need to try to convert it using a macro after receiving Robj. This is due to a limitation in the extendr implementation and cannot be removed as long as extendr is used.

Yes there are limitations to extendr but I the people from extendr are interested in the pain points of r-polars and are willing to improve their library to our needs. I think it is a lot easier to work with them.

I think I've been contributing to the extendr organization for the past few years, but I don't think so. I think at least it's harder than rewriting r-polars based on savvy.

The last point I have is, that it very unclear how savvy will play together with extendr. What happens if another rust library that exports bindings to R using extendr wants to export a polars dataframe as the result of a function call. Currently this is possible, but if r-polars uses savvy it is unclear what will happen. Savvy would probably need to enable compatibility with extendr which seems unlikely. Or the exporting library would need to switch to savvy as well.

Please use the Arrow C stream interface. https://arrow.apache.org/docs/format/CStreamInterface.html

As you are probably aware, many of the issues are major rewrites that have nothing to do with the choice between extendr and savvy. On the other hand, even if we continue to use extendr, the effort will hardly change, so I think it is time to switch to savvy (the current code can hardly be reused, I don't understand the benefits of continuing to use extendr at all)

CGMossa commented 3 months ago

Let me also put my support for the continuing collaboration between r-polars and extendr.

As a side note: @ju6ge Discord is not the main source of discussion, decisions, and debates for extendr. Our GitHub issues, PRs, and discussions are the main sources for communications. I personally like to have an informal environment to discuss things liberally, but there is nothing we have going on right now, that is not summarised (and sometimes concisely) on GitHub. I think open-source is great, but we strive for open collaboration as well.

I must admit, the next 6 weeks, I'm very occupied with wrapping up my PhD thesis, but after that, I will definitely produce anything necessary on our end, to continue the collaboration.

There is a lot of history together with R-polars. We grew up together, to use a soft-phrase. I admit, I was pretty junior in the beginning, but now, I'm much more able to accommodate wishes that strain the current status quo.

For instance, the requested feature of having #[extendr] work on multiple impl-blocks is something I've experimented with, and debated, and suggested designs several times. We will definitely get there, there are some concessions that needs to be made, but I am very optimistic.

Lastly, I'd just like to declare, that I welcome any ping from you @ju6ge (or anyone from r-polars), and that I have now adopted the policy to always reply, regardless if I have a robust, solid, and final solution to the request given.

ju6ge commented 3 months ago

@eitsupi

extendr https://github.com/pola-rs/r-polars/blob/086d6df8a925949d9d61634d44af0f68ff667735/src/rust/src/lazy/whenthen.rs#L27-L33

There is no reason to do this so convoluted with extendr. You can just do it like this:

#[extendr]
impl RPolarsWhen {
   pub fn new(condition: RPolarsExpr) -> RResult<RPolarsWhen> {
      Ok(RPolarsWhen { inner: dsl::when(condition.inner) })
   }
}

The only reason to use Robj directly is when excepting different kinds of R datatypes in the same function call and dynamically converting to something that polars understands. If using Robj in an function is required the same problems will apply to savvy.

What I am talking about with the Sexpr types is that in savvy I have seen no example of the following:

#[extendr]
impl Person {
  pub fn name(&self) -> String { … } 
}

Instead

#[savvy]
impl Person {
   pub fn name(&self) -> savvy::Sexp { … }
}

So for savy all conversions happen in the rust part of the code, thereby removing type safety from the binding interface. Which in my opinion is very undesirable!

CGMossa commented 3 months ago

Good point with returning Result. I must admit, I've seen other FFI packages that does this, and it seems very doable.

Found it: I suggest using this issue for asking which things you want us to prioritise (at October and until the end of the year). See roadmap issue here https://github.com/extendr/extendr/issues/783

I've got my own little list, that I go through, but it would be nice to get a prioritised list by actual users.

Awesome! I'm very happy with this input.

eitsupi commented 3 months ago

The only reason to use Robj directly is when excepting different kinds of R datatypes in the same function call and dynamically converting to something that polars understands.

I believe this was related to the fact that extendr cannot handle Result. See extendr/extendr#278.

So for savy all conversions happen in the rust part of the code, thereby removing type safety from the binding interface. Which in my opinion is very undesirable!

I'm sorry, but I don't understand what your point of view is.

ju6ge commented 3 months ago

I believe this was related to the fact that extendr cannot handle Result. See extendr/extendr#278.

r-polars is already using a feature flag that allows returning Result in function interface: https://github.com/pola-rs/r-polars/blob/086d6df8a925949d9d61634d44af0f68ff667735/src/rust/Cargo.toml#L39-L42

So result types currently return an r dictionary with, two fields: ok and err if err is NULL than no error happend and ok contains the typed rust value. Otherwise err contains the error message. r-polars also already has unwrapping code to unwrap the extendr_result class and throwing an R error if the result contains an Error. https://github.com/pola-rs/r-polars/blob/086d6df8a925949d9d61634d44af0f68ff667735/R/error_conversion.R#L34-L44

The unwrapping of Results is annoying, it would be much nicer if extendr could just do this automatically and trigger an error on the r side of things, instead of requiring the library authors to handle this manually.

I don't see any reason why passing this result into a function again would make any sense, so using Robj in case a result is returned from a functions seems like a bad idea.

I'm sorry, but I don't understand what your point of view is.

The Problem here is that when a functions returns Sexpr in rust it looses all type information behind the return value. (Of the value in Sexpr). This means that chaining rust functions on the rust side becomes a lot more complicated, because passing Sexpr between functions would be very cumbersome. And it requires manual conversion within the library code. It is much nicer to use only rust types on the rust side of things and letting extendr to the heavy work for converting to and from r. That beeing said, it is really a pitty that extendr does not detect bad transformations anymore. But that is something @CGMossa is willing to implement again, just better that the previous versions, that also did so via String comparision just on the extendr side of things.

eitsupi commented 3 months ago

I don't see any reason why passing this result into a function again would make any sense, so using Robj in case a result is returned from a functions seems like a bad idea.

I don't know much about that feature as it is only used in r-polars, but I believe that the reason all functions now take Robj as an argument is because otherwise they would not be able to return a proper Result. See #233 or something.

This means that chaining rust functions on the rust side becomes a lot more complicated, because passing Sexpr between functions would be very cumbersome.

What does that have to do with r-polars? r-polars is a wrapper for polars, so basically there should be no situation where r-polars functions are chained together, right?

ju6ge commented 3 months ago

I don't know much about that feature as it is only used in r-polars, but I believe that the reason all functions now take Robj as an argument is because otherwise they would not be able to return a proper Result. See #233 or something.

This does not make any sense. Nothing in that merge request touched the functions signatures on the rust side of things. At least from the diff Robj was used as function input before this. So Robj was there before. I think it mostly comes down to r-polars using extendr from very early on and no ever changing the implementation.

What does that have to do with r-polars? r-polars is a wrapper for polars, so basically there should be no situation where r-polars functions are chained together, right?

I am just saying that savvy increases the complexity of the function interface for native rust types. I don't see why this would be beneficial and that it will make calling functions within the rust side of r-polars more complicated. You are corrects that this might not be that relevant since r-polars mostly just wraps polars. But there might be edge cases where this will be relevant.

eitsupi commented 3 months ago

I am just saying that savvy increases the complexity of the function interface for native rust types.

If you think so, I suggest you open the issue on https://github.com/yutannihilation/savvy. That should have a savvy philosophy. https://yutannihilation.github.io/savvy/guide/key_ideas.html

In any case, the likelihood that I will be persuaded by you to use extendr seems infinitesimally small. However, I don't think anyone will stop you from rewriting r-polars using extendr and it would be great if you could work with the extendr community to accomplish that. Needless to say I have been too busy lately to complete the savvy based rewrite, so if you could do a better rewrite than I did it would be greatly appreciated.

eitsupi commented 3 months ago

This does not make any sense. Nothing in that merge request touched the functions signatures on the rust side of things. At least from the diff Robj was used as function input before this. So Robj was there before. I think it mostly comes down to r-polars using extendr from very early on and no ever changing the implementation.

Actually only @sorhawell knows why this is being used, but since @sorhawell is no longer active we can only read @sorhawell's comments. https://github.com/pola-rs/r-polars/pull/417#discussion_r1355803824

ju6ge commented 3 months ago

In any case, the likelihood that I will be persuaded by you to use extendr seems infinitesimally small. However, I don't think anyone will stop you from rewriting r-polars using extendr and it would be great if you could work with the extendr community to accomplish that. Needless to say I have been too busy lately to complete the savvy based rewrite, so if you could do a better rewrite than I did it would be greatly appreciated.

Well I you are not the only person with a stake in r-polars, so I don't need to persuade you of anything. If you wish to go forward with an rewrite with savvy you can do so. I have provided my opinion on this here so that everybody can make an decision based on a diverse set of perspectives. Also you are the one doing the rewrite, I could just fork and keep it going from here if I really needed to. So the burden of proof that savvy actually is better for this use case than extendr lies on you. I would rather update the codebase to use the newest version of extendr with all of its features. I have made my case.

eitsupi commented 3 months ago

Can the latest version of extendr adopt the same directory structure as py-polars (which is now polars-python)? No, right? If so, it would be less maintenance work to adopt savvy.

Maybe in a few years, we will be able to do the same thing with extendr, but we can think about it then, can't we?

It is difficult to maintain the current code base; problems like https://github.com/pola-rs/r-polars/pull/1183#issuecomment-2295253803 are frequent.

ju6ge commented 3 months ago

an the latest version of extendr adopt the same directory structure as py-polars (which is now polars-python)? No, right?

I don't see why directory structure would matter here. Python is different from R in how libraries work. The important part is that all classes from polars can be exported. That will stay the same no matter how you generate the bindings. The cruicial part is the conversion from r to rust. This is the part that is convoluted here. And R dynamic syntax is what makes this hard. Replacing the way the bindings are generated does nothing here. You still end up with a ffi interface that needs to be called with the correct data. Maybe savvy can do this better, but I don't think so.

eitsupi commented 3 months ago

I don't see why directory structure would matter here.

This is because r-polars is a clone of python-polars, and it is far easier to track updates to python-polars if they have the same file structure. This is my conclusion after months of updating r-polars. I just want to adopt savvy instead of extendr to make my work easier. Please check #1126.

GuillaumePressiat commented 2 months ago

Hi,

I regularly follow this r-polars repo because of my works with both R and Python. Coming from R I have found py-polars to be exactly what I was waiting for since years in Python (dataframe library with a dplyr philosophy). I've also began to code in Rust thanks to pola.rs.

From an external point of view I've found r-polars not so easy, mainly due to the pl$... syntax. It's an OOP question, S3 vs S4 vs R6... Speaking about complete rewrite, have you think about a way to bypass the $ in your proposal? I've seen tidypolars and I'm not talking about that. Complete rewrite with a dplyr/pola.rs coding experience? I mean have you considered another constructor than $? @ will almost do the same albeit with a better readibility? I've think about it and it's not an easy question. Maybe this reflexion is somewhere here in your repo...

and of course thanks for your work!

etiennebacher commented 2 months ago

Hi there!

From an external point of view I've found r-polars not so easy, mainly due to the pl$... syntax.

I agree that's probably deterring some people from using the "raw" polars syntax directly in R, but I think it's also useful to be able to quickly compare Python and R code. Basically, except for some differences between languages (e.g. using c() instead of []), being able to only replace $ by . (or vice-versa) is very convenient.

@ will almost do the same albeit with a better readibility?

Not convinced about that. @ is used mostly by S4-based packages. It seems most of them are in Bioconductor and plenty of people (including myself) never deal with that. $ is used in all packages built on R6, including very popular ones like shiny (I think).

Using S3 instead of $ or @ to chain operations would create massive namespace conflicts with the base library and other packages since polars basically reimplements a bunch of basic functions, and not all of those functions are S3 generics.

I've seen tidypolars and I'm not talking about that

Why not? The point is precisely to use dplyr and other tidyverse packages syntax but with polars under the hood. Isn't this what you're looking for?

I think the current "polars ecosystem" in R makes some sense: polars for an almost identical experience as in python, tidypolars to use the tidyverse syntax, and polarssql for SQL syntax. That said, let us know if something is unclear to you and curious to see your opinion on that :)

GuillaumePressiat commented 2 months ago

Many thanks for you answer.

There are different point of views :)

My question was not really about technical aspects but more about user experience. Whereas with pandas Python code was not neat, with pola.rs Python code is now easy to read.

For me polars's expressivness leads to focus on it (all verbs are in it, filter, group_by, with_columns, etc.). I agree that it's good to keep its syntax. My opinion is that for readability purpose in R, I don't like too much $ in R code. This symbol is vertical and breaks lines visually. So my question with @ which is more like a ligation between terms. In dplyr there are not $ and it was sort of a revolution in R. A good one with neat code. For code writers and also code readers (first time we see it in doc we don't write it). In summary, I am wondering if r-polars can by itself bring another revolution to R but without $ $. We see it and we say: it's different and I like it.

(
  as_polars_df(iris)
      $with_columns((pl
                $col('Sepal.Length') * 2)
                $alias('Sepal.Length.Bigger'))
)

A dumb proposal may be with @:

(
  as_polars_df(iris)
      @with_columns(
                (pl@col('Sepal.Length') * 2)
                 @alias('Sepal.Length.Bigger'))
)

It also could be nice not to have to write pl$col and just col inside with_columns by sort of inheritance? Find a way to skip col R function in this context.

About tidypolars my thinking is: it seems really a big work to keep tidypolars wired with r-polars evolutions forwarded by pola.rs ones... I mean evolutions are almost unpredictable in such a vast ecosystem. Maybe I'm not too fan of under the hood :) But Rust pola.rs's seems more and more stable now, as data.table / dtplyr it's surely a good lead.

Again there are different point of views! In summary, by no mean R api could be the same as Python api.

Thanks for reading.

etiennebacher commented 2 months ago

My opinion is that for readability purpose in R, I don't like too much $ in R code. This symbol is vertical and breaks lines visually

I get your point, but we're not going to break all the polars code out there for what is in the end a matter of personal preferences. Some people prefer chaining functions with $ as in R6-based packages.

It also could be nice not to have to write pl$col and just col inside with_columns by sort of inheritance? Find a way to skip col R function in this context.

It would be nice, but that introduces complexity since we'd need to understand when we are in a Polars context and when we're not. This might be trivial when we can detect that we're inside a with_columns() call, but in some cases we create expressions outside a call and then execute them inside so preventing conflicts with the base function col() is not as easy as it seems. Why not in the future, but for now there are many other priorities in r-polars.

About tidypolars my thinking is: it seems really a big work to keep tidypolars wired with r-polars evolutions forwarded by pola.rs ones...

Implementing it was some work, yes, but now most dplyr and tidyr functions are translated and I don't think they'll add many more in the future. Of course, there are still tons of functions from other packages that could be translated, but I'm not afraid of breaking changes in pola.rs affecting tidypolars too much. Basically every tidypolars release contains a few tweaks to accommodate pola.rs changes but nothing major (and I keep the polars and tidypolars releases quite synchronized).

Anyway, thanks for the feedback and don't hesitate if you have other questions or bug reports

eitsupi commented 1 week ago

@GuillaumePressiat I think we will use the S7 class in the future for various advantages, but in S7, @ is reserved for properties and not for storing methods.

GuillaumePressiat commented 6 days ago

Interesting and somewhat a related topic indeed. Thanks for this @eitsupi!

pola-rs / r-polars