pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
405 stars 35 forks source link

There are multiple functions that do the same thing (for arrow) #732

Closed eitsupi closed 3 weeks ago

eitsupi commented 4 months ago

Some context can be found in #64

While looking into the Arrow implementation, I noticed that this package has several functions related to the Arrow C interface.

The current publicly available function for creating DataFrames from arrow Tables is polars:::arrow_to_rdf (ported from Python Polars) written in R inside polars::as_polars_df (formerly polars::pl$from_arrow), but here is a function written in Rust with almost the same functionality.

https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/R/extendr-wrappers.R#L41

This function does not appear to be used anywhere.

Presumably this is a function created for experimental purposes and used in the benchmarks stored below.

https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/inst/misc/examples/from_arrow.R

I ran this script with the current development version (on #730) and found that pl$from_arrow and rb_list_to_df are about the same speed, so perhaps rb_list_to_df can be removed.

> source("/workspaces/r-polars/inst/misc/examples/from_arrow.R", encoding = "UTF-8")

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

total n chunks: 475# A tibble: 4 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 to_arrow           655.57ms    1.44s     0.518  915.77KB     1.45    10    28     19.31s <NULL>
2 from_arrow            827ms    1.74s     0.547    3.15MB     1.75    10    32     18.27s <NULL>
3 from_arrow_no_rec… 654.13ms 746.89ms     1.31   959.62KB     4.33    10    33      7.62s <NULL>
4 from_arrow_all_se…    1.08s    1.13s     0.827 1011.25KB     3.97    10    48     12.09s <NULL>
# ℹ 3 more variables: memory <list>, time <list>, gc <list>
# A tibble: 4 × 13
  expression   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 to_arrow   140ms 282.19ms     3.64       39KB    0.364    10     1      2.75s <NULL> <Rprofmem>
2 from_arrow 130ms 165.84ms     6.03     86.8KB    3.02     10     5      1.66s <NULL> <Rprofmem>
3 from_arro… 915ms    1.13s     0.869   875.5KB    4.17     10    48      11.5s <NULL> <Rprofmem>
4 to_arrow2  130ms  176.8ms     5.43     38.8KB    0.543    10     1      1.84s <NULL> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>
There were 50 or more warnings (use warnings() to see the first 50)

It makes sense that arrow_to_rdf is copied from Python Polars, which is actively improving its speed, and is as fast or faster than rb_list_to_df.

There are also

https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/R/extendr-wrappers.R#L47-L51

eitsupi commented 4 months ago

Some context is found in #326

etiennebacher commented 4 months ago

perhaps rb_list_to_df can be removed.

If it doesn't have any obvious perf benefits it can be removed

sorhawell commented 4 months ago

Seems to me also polars:::rb_list_to_df is old and can be removed together with any subsequent dead code, as warned by the rust compiler. Superseded by arrow_to_rdf()

however arrow_stream_to_df, export_df_to_arrow_stream and new_arrow_stream are called from extendr-polars repository they are used to zero copy share polars DataFrames between to different polars compilation units (R/rust crates with polars).

sorhawell commented 4 months ago

I would keep arrow_stream_to_series even though not being used on R side, as it is actually the general case of arrow_stream_to_df

both calls arrow_stream_to_series_internal

eitsupi commented 4 months ago

Thanks, but I suspect that using a set of functions from the nanoarrow package would be better than using my own solution of passing the pointer as str.

See also https://github.com/JosiahParry/arrow-extendr

eitsupi commented 4 months ago

Oh, sorry, I didn't realize that nanoarrow supports passing pointers as str. I'd like to do some refactoring after #730 is merged.

eitsupi commented 4 months ago

@sorhawell This causes segmentation fault.

stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars)
polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out))

In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested. Also, as far as I can see, the only advantage of writing this on the Rust side is that nanoarrow is not needed, and I feel that using nanoarrow is advantageous because it is easier to maintain.

eitsupi commented 4 months ago

Thinking about it again, I opened the issue pola-rs/polars#14208 because there is no need to implement it here if there is the Arrow C interface in the DataFrame of Rust Polars itself.

sorhawell commented 3 months ago

@sorhawell This causes segmentation fault.

stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars)
polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out))

In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested. Also, as far as I can see, the only advantage of writing this on the Rust side is that nanoarrow is not needed, and I feel that using nanoarrow is advantageous because it is easier to maintain.

Yes you can very well achieve an UB crash such as segfault.

This is because:

The arrow C stream and C data interface are very unsafe. There are exact rules to follow or bad things happen.

The protocols is an orchestrated ping-pong between "producer" and "consumer" where each step on its own is unsafe.

arrow_stream_to_series is a private function which is one step in the exchange and is by it self not safe. I think it is possible to segfault with private functions from nanoarrow also.

Only the full implementation of all steps can be safe. The simple integration test is placed in extendr-polars which is annoying if not run in CI. Maybe there could be written some unit test where r-polars is both the producer and consumer.

Btw I know it is very desirable to move as much as possible to rust-polars but it is tricky to achieve as far as I know.

nanoarrow is very convenient and move a burden away from r-polars maintainers. So hurray for nanoarrow. There are some arrow types that rust-polars implement differently. I would test for if various types are supported.

Since I'm not active currently, I fully understand ditching any code that will not be further developed in a near future and which is not tested in the CI.