Closed eitsupi closed 3 weeks ago
Some context is found in #326
perhaps rb_list_to_df can be removed.
If it doesn't have any obvious perf benefits it can be removed
Seems to me also polars:::rb_list_to_df
is old and can be removed together with any subsequent dead code, as warned by the rust compiler. Superseded by arrow_to_rdf()
however
arrow_stream_to_df
, export_df_to_arrow_stream
and new_arrow_stream
are called from extendr-polars repository they are used to zero copy share polars DataFrames between to different polars compilation units (R/rust crates with polars).
I would keep arrow_stream_to_series
even though not being used on R side, as it is actually the general case of arrow_stream_to_df
both calls arrow_stream_to_series_internal
Thanks, but I suspect that using a set of functions from the nanoarrow
package would be better than using my own solution of passing the pointer as str.
Oh, sorry, I didn't realize that nanoarrow supports passing pointers as str. I'd like to do some refactoring after #730 is merged.
@sorhawell This causes segmentation fault.
stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars)
polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out))
In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested. Also, as far as I can see, the only advantage of writing this on the Rust side is that nanoarrow is not needed, and I feel that using nanoarrow is advantageous because it is easier to maintain.
Thinking about it again, I opened the issue pola-rs/polars#14208 because there is no need to implement it here if there is the Arrow C interface in the DataFrame of Rust Polars itself.
@sorhawell This causes segmentation fault.
stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars) polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out))
In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested. Also, as far as I can see, the only advantage of writing this on the Rust side is that nanoarrow is not needed, and I feel that using nanoarrow is advantageous because it is easier to maintain.
Yes you can very well achieve an UB crash such as segfault.
This is because:
The arrow C stream and C data interface are very unsafe. There are exact rules to follow or bad things happen.
The protocols is an orchestrated ping-pong between "producer" and "consumer" where each step on its own is unsafe.
arrow_stream_to_series is a private function which is one step in the exchange and is by it self not safe. I think it is possible to segfault with private functions from nanoarrow also.
Only the full implementation of all steps can be safe. The simple integration test is placed in extendr-polars which is annoying if not run in CI. Maybe there could be written some unit test where r-polars is both the producer and consumer.
Btw I know it is very desirable to move as much as possible to rust-polars but it is tricky to achieve as far as I know.
nanoarrow is very convenient and move a burden away from r-polars maintainers. So hurray for nanoarrow. There are some arrow types that rust-polars implement differently. I would test for if various types are supported.
Since I'm not active currently, I fully understand ditching any code that will not be further developed in a near future and which is not tested in the CI.
Some context can be found in #64
While looking into the Arrow implementation, I noticed that this package has several functions related to the Arrow C interface.
The current publicly available function for creating DataFrames from arrow Tables is
polars:::arrow_to_rdf
(ported from Python Polars) written in R insidepolars::as_polars_df
(formerlypolars::pl$from_arrow
), but here is a function written in Rust with almost the same functionality.https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/R/extendr-wrappers.R#L41
This function does not appear to be used anywhere.
Presumably this is a function created for experimental purposes and used in the benchmarks stored below.
https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/inst/misc/examples/from_arrow.R
I ran this script with the current development version (on #730) and found that
pl$from_arrow
andrb_list_to_df
are about the same speed, so perhapsrb_list_to_df
can be removed.It makes sense that
arrow_to_rdf
is copied from Python Polars, which is actively improving its speed, and is as fast or faster thanrb_list_to_df
.There are also
https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/R/extendr-wrappers.R#L47-L51