Closed lgautier closed 9 months ago
Thanks for the report. Unfortunately I have this issue when I try to install rpy2
: https://github.com/rpy2/rpy2/issues/1044
Downgrading pip
and rpy2
didn't solve it so I can't reproduce this bug
Thanks for the report. Unfortunately I have this issue when I try to install
rpy2
: rpy2/rpy2#1044Downgrading
pip
andrpy2
didn't solve it so I can't reproduce this bug
Hi, rpy2
is mostly not supported on Windows, unless you use WSL.
I should have access to a machine with Ubuntu in the next few days, I'll see then (unless @eitsupi deals with this first)
Thanks for the report. This is curious because it seems to read through an Arrow file with no problem.
>>> import polars as pl
>>> import pyarrow.feather as feather
>>>
>>> pl.DataFrame({'a': ['wx', 'yz', 'wx']}, schema = {'a': pl.Categorical}).write_ipc("test.arrow")
>>> feather.read_table("test.arrow")
pyarrow.Table
a: dictionary<values=large_string, indices=uint32, ordered=0>
----
a: [ -- dictionary:
["wx","yz"] -- indices:
[0,1,0]]
> arrow::read_ipc_file("test.arrow")
Error: Cannot convert Dictionary Array of type `dictionary<values=large_string, indices=uint32, ordered=0>` to R
> polars::pl$scan_ipc("test.arrow")$collect()
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ cat │
╞═════╡
│ wx │
│ yz │
│ wx │
└─────┘
I don't know what's in rpy2arrow.pyarrow_table_to_r_table
, but is it possible that it is through the R arrow package?
Maybe related to apache/arrow#39603
Seems coming from here https://github.com/apache/arrow/blob/05b8f366e17ee6f21df4746bb6a65be399dfb68d/r/R/arrowExports.R#L311-L313
> arrow::read_ipc_file("test.arrow", as_data_frame = FALSE) |> polars::as_polars_df()
*** caught segfault ***
address 0x28, cause 'memory not mapped'
Traceback:
1: (function (array, array_ptr, schema_ptr) { invisible(.Call(`_arrow_ExportArray`, array, array_ptr, schema_ptr))})(<environment>, <pointer: 0x7fe103882000>, <pointer: 0x7fe103882050>)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Hmmm, the conversion between R arrow and pyarrow seems to be fine, so it may be a problem of this package.
> at <- arrow::read_ipc_file("test.arrow", as_data_frame = FALSE)
> reticulate::r_to_py(at)
pyarrow.Table
a: dictionary<values=large_string, indices=uint32, ordered=0>
----
a: [ -- dictionary:
["wx","yz"] -- indices:
[0,1,0]]
> reticulate::r_to_py(at) |> reticulate::py_to_r()
Table
3 rows x 1 columns
$a <dictionary<values=large_string, indices=uint32>>
The conversion from chunck to Series seems to be working well.
> polars::.pr$Series$from_arrow("foo", at$a$chunks[[1]])
$ok
polars Series: shape: (3,)
Series: 'foo' [cat]
[
"wx"
"yz"
"wx"
]
$err
NULL
attr(,"class")
[1] "extendr_result"
Also works.
> rbr <- arrow::read_ipc_file("test.arrow", as_data_frame = FALSE) |> arrow::as_record_batch_reader()
> polars::.pr$DataFrame$from_arrow_record_batches(rbr$batches())
$ok
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ cat │
╞═════╡
│ wx │
│ yz │
│ wx │
└─────┘
$err
NULL
attr(,"class")
[1] "extendr_result"
Crashes.
> at <- arrow::read_ipc_file("test.arrow", as_data_frame = FALSE)
> polars:::arrow_to_rdf(at)
*** caught segfault ***
address 0x28, cause 'memory not mapped'
Traceback:
1: (function (array, array_ptr, schema_ptr) { invisible(.Call(`_arrow_ExportArray`, array, array_ptr, schema_ptr))})(<environment>, <pointer: 0x7f9cba482000>, <pointer: 0x7f9cba482050>)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
This is probably what is causing the problem.
> at$a |> polars:::is_arrow_dictonary()
[1] TRUE
> at$a |> polars:::arrow_to_rseries_result("foo", values = _, rechunk = TRUE)
*** caught segfault ***
address 0x28, cause 'memory not mapped'
Traceback:
1: (function (array, array_ptr, schema_ptr) { invisible(.Call(`_arrow_ExportArray`, array, array_ptr, schema_ptr))})(<environment>, <pointer: 0x7fc9da882000>, <pointer: 0x7fc9da882050>)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Related to #497
Hi @lgautier, this bug has been fixed in the main branch and will be included in the next release.
Since the next release will include the deprecation of pl$from_arrow
(#728), we would appreciate it if you would use as_polars_df
instead of pl$from_arrow
.
Thanks @eitsupi . That was quick!
Three questions:
- Is there a binary build I can already try?
Unfortunately, we haven't built any binaries on the main branch (I'll add CI next time), so there are no binaries to try right away.
- Will the next release be 0.13 rather than 0.12.3 (since deprecation would break the API)?
The next release will be 0.13.0.
There are currently few breaking changes (See the NEWS.md
file), but there is a possibility that Rust Polars will be updated before release, which may introduce more breaking changes.
- Do you have a rough date estimate for that next release?
Probably a few days to a week?
@etiennebacher Any thoughts on the next release? To be honest, I don't have the energy to update Rust Polars right now, so I think it's okay to release 0.13.0 right away and hold off on updating Rust Polars until the 0.13.x or 0.14.0.
I think the next rust-polars release will not be before 2-3 weeks at least (but it's a bit uncertain of course).
On our side, I think we could make a new release in 1-2 weeks. I'd like to tackle the envvars handling so that we can release 0.13.0 with good docs for envvars and options, but the next week is gonna be quite busy for me. In any case, we have enough stuff to release so the next rust-polars update could go in 0.14.0
@lgautier The new version binaries can be installed from R-universe for now.
Thanks a lot for the update. I was meant to add last week that a release in several weeks works, and that in the in the meantime binary builds for main
(even if a nightly frequency-wise) would help rpy2-arrow
prepare for it.
The issue appears rather specific to the combination of:
DataFrame
going through the path Python polars -> Python Arrow Table -> R Arrow Table -> R polarsThe smallest example I have figured out to demonstrate the issue is using
rpy2-arrow
(https://github.com/rpy2/rpy2-arrow). The following is going through 3 different combination of paths and column types, until a fourth one that fails with a segfault.This is the backtrace when running it through a C debugger: