pola-rs / r-polars

Polars R binding
https://pola-rs.github.io/r-polars/
Other
473 stars 35 forks source link

Segfault when `DataFrame` going through Python polars -> Python Arrow Table -> R Arrow Table -> R polars and column is type is Category #725

Closed lgautier closed 9 months ago

lgautier commented 9 months ago

The issue appears rather specific to the combination of:

The smallest example I have figured out to demonstrate the issue is using rpy2-arrow (https://github.com/rpy2/rpy2-arrow). The following is going through 3 different combination of paths and column types, until a fourth one that fails with a segfault.

import rpy2.robjects as ro
import rpy2_arrow.arrow as rpy2arrow
import polars

_pl_from_arrow = ro.r('polars::pl$from_arrow')

def rarrow_to_rpolars(r_arrow_table: ro.Environment) -> polars.DataFrame:
    """ Take an Arrow Table as wrapped by the R package "arrow",
    and return a Python polars.DataFrame for that table."""
    with ro.default_converter.context():
        print("Calling R's polars::pl$from_arrow()")
        res = _pl_from_arrow(r_arrow_table)
        print('Done.')
        return res

def pypolars_to_rpolars(dataf: polars.DataFrame) -> ro.Environment:
    """ Take a Python polars.DataFrame, and return an R
    RPolarsDataFrame.
    """
    _ = dataf.to_arrow()
    r_arrow_table = rpy2arrow.pyarrow_table_to_r_table(_)
    return rarrow_to_rpolars(r_arrow_table)

# Create an R Arrow table with numerical and string columns.
print('R Arrow -> R polars (strings).')
r_arrow_dataf = ro.r("""
library(arrow)
tbl <- arrow::arrow_table(
  data.frame(a = I(c("wx", "yz", "wx")))
)
tbl
""")
# Conversion to an R RPolarsDataFrame works.
res = rarrow_to_rpolars(r_arrow_dataf)

# Create a Python polars.Dataframe with the same table content
# (numerical and strings). The conversion an RPolarsDataFrame works.
print('Python polars -> R polars (string).')
py_polars_dataf = polars.DataFrame({'a': ['wx', 'yz', 'wx']})
r_polars_dataf = pypolars_to_rpolars(py_polars_dataf)

# Create an R Arrow table with a column of R type factor, which
# will become an Arrow DictionnaryType[str, int].
print('R Arrow -> R polars (Category).')
r_arrow_dataf = ro.r("""
library(arrow)
tbl <- arrow::arrow_table(
  data.frame(a = factor("wx", "yz", "wx"))
)
tbl
""")
# The conversion to an R RPolarsDataFrame works.
res = rarrow_to_rpolars(r_arrow_dataf)

# Create a Python polars.Dataframe with the same table content
# (Categorial/DictionnaryType[str, int]).
print('Python polars -> R polars (Category).')
py_polars_dataf = polars.DataFrame({'a': ['wx', 'yz', 'wx']},
                                   schema = {'a': polars.Categorical})
# The conversion to an RPolarsDataFrame segfaults.
r_polars_dataf = pypolars_to_rpolars(py_polars_dataf)

This is the backtrace when running it through a C debugger:

Calling R's polars::pl$from_arrow()

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffd1ccfe9a in arrow::ExportType(arrow::DataType const&, ArrowSchema*) () from /usr/local/packages/R/4.3/lib/R/library/arrow/libs/arrow.so
(gdb) back
#0  0x00007fffd1ccfe9a in arrow::ExportType(arrow::DataType const&, ArrowSchema*) ()
   from /usr/local/packages/R/4.3/lib/R/library/arrow/libs/arrow.so
#1  0x00007fffd1cd014b in arrow::ExportArray(arrow::Array const&, ArrowArray*, ArrowSchema*) ()
   from /usr/local/packages/R/4.3/lib/R/library/arrow/libs/arrow.so
#2  0x00007fffd168fe60 in ExportArray (array=std::shared_ptr<arrow::Array> (use count 2, weak count 0) = {...}, 
    array_ptr=..., schema_ptr=..., schema_ptr@entry=...) at bridge.cpp:131
#3  0x00007fffd1652d9b in _arrow_ExportArray (array_sexp=<optimized out>, array_ptr_sexp=0x555560ee8e98, 
    schema_ptr_sexp=<optimized out>) at arrowExports.cpp:668
#4  0x00007ffff6d0120e in R_doDotCall (fun=fun@entry=0x7fffd1652d00 <_arrow_ExportArray(SEXP, SEXP, SEXP)>, 
    nargs=<optimized out>, cargs=cargs@entry=0x7ffffffa3f40, call=call@entry=0x55555b24b878) at dotcode.c:874
#5  0x00007ffff6d440a3 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:8002
#6  0x00007ffff6d58b30 in Rf_eval (e=0x55555b24b990, rho=rho@entry=0x555560ee89c8) at eval.c:1013
#7  0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x555560ee8c30, newrho=newrho@entry=0x555560ee89c8, 
    sysparent=<optimized out>, rho=rho@entry=0x555555d41e68, arglist=arglist@entry=0x555560ee8b88, 
    op=op@entry=0x55555b24bd80) at eval.c:2187
#8  0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x555560ee8c30, op=op@entry=0x55555b24bd80, 
    arglist=arglist@entry=0x555560ee8b88, rho=rho@entry=0x555555d41e68, suppliedvars=<optimized out>) at eval.c:2113
#9  0x00007ffff6d58c5c in Rf_eval (e=0x555560ee8c30, rho=0x555555d41e68) at eval.c:1140
#10 0x00007ffff6ce70c2 in protectedEval (d=d@entry=0x7ffffffa4ae0) at context.c:851
#11 0x00007ffff6ce8a1a in R_ToplevelExec (fun=fun@entry=0x7ffff6ce70a0 <protectedEval>, data=data@entry=0x7ffffffa4ae0)
    at context.c:799
#12 0x00007ffff6ce8a8d in R_tryEval (e=<optimized out>, env=<optimized out>, ErrorOccurred=0x7ffffffa4b18) at context.c:865
#13 0x00007fffbc8b1bb5 in extendr_api::robj::operators::Operators::call::{{closure}} ()
   from /usr/local/packages/R/4.3/lib/R/library/polars/libs/polars.so
#14 0x00007fffbc8b1aac in extendr_api::robj::operators::Operators::call ()
   from /usr/local/packages/R/4.3/lib/R/library/polars/libs/polars.so
#15 0x00007fffbc9d647a in r_polars::arrow_interop::to_rust::arrow_array_to_rust ()
   from /usr/local/packages/R/4.3/lib/R/library/polars/libs/polars.so
#16 0x00007fffbcb596c1 in wrap__RPolarsSeries__from_arrow ()
   from /usr/local/packages/R/4.3/lib/R/library/polars/libs/polars.so
#17 0x00007ffff6d0122a in R_doDotCall (fun=fun@entry=0x7fffbcb59520 <wrap__RPolarsSeries__from_arrow>, 
    nargs=<optimized out>, cargs=cargs@entry=0x7ffffffa7210, call=call@entry=0x55555b6ac030) at dotcode.c:871
#18 0x00007ffff6d440a3 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:8002
#19 0x00007ffff6d58b30 in Rf_eval (e=0x55555b6ac068, rho=rho@entry=0x555560ee8ed0) at eval.c:1013
#20 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x5555601bbe88, newrho=newrho@entry=0x555560ee8ed0, 
    sysparent=<optimized out>, rho=rho@entry=0x555560d42e70, arglist=arglist@entry=0x555560ee8fe8, 
    op=op@entry=0x55555b6ac110) at eval.c:2187
#21 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x5555601bbe88, op=op@entry=0x55555b6ac110, 
    arglist=arglist@entry=0x555560ee8fe8, rho=rho@entry=0x555560d42e70, suppliedvars=<optimized out>) at eval.c:2113
#22 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
--Type <RET> for more, q to quit, c to continue without paging--c
#23 0x00007ffff6d58b30 in Rf_eval (e=0x5555601b2530, rho=rho@entry=0x555560d42e70) at eval.c:1013
#24 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a2699e0, newrho=newrho@entry=0x555560d42e70, sysparent=<optimized out>, rho=rho@entry=0x555560b268a8, arglist=arglist@entry=0x555560d3f200, op=op@entry=0x5555601b2920) at eval.c:2187
#25 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a2699e0, op=op@entry=0x5555601b2920, arglist=arglist@entry=0x555560d3f200, rho=rho@entry=0x555560b268a8, suppliedvars=<optimized out>) at eval.c:2113
#26 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#27 0x00007ffff6d58b30 in Rf_eval (e=0x55555a26d820, rho=0x555560b268a8) at eval.c:1013
#28 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560d3f708) at eval.c:833
#29 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560d3f5f0, symbol=0x555556d65bd0, value=0x555560d3f708) at eval.c:5467
#30 getvar (symbol=0x555556d65bd0, rho=0x555560d3f5f0, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#31 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#32 0x00007ffff6d58b30 in Rf_eval (e=0x55555a249240, rho=0x555560d3f5f0) at eval.c:1013
#33 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560d3f548) at eval.c:833
#34 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560d3f4a0, symbol=0x555555d53f30, value=0x555560d3f548) at eval.c:5467
#35 getvar (symbol=0x555555d53f30, rho=0x555560d3f4a0, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#36 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#37 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24ef28, rho=0x555560d3f4a0) at eval.c:1013
#38 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560d3f468) at eval.c:833
#39 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560d3f388, symbol=0x555555d53f30, value=0x555560d3f468) at eval.c:5467
#40 getvar (symbol=0x555555d53f30, rho=0x555560d3f388, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#41 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#42 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24e550, rho=0x555560d3f388) at eval.c:1013
#43 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560d3f318) at eval.c:833
#44 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560d3f270, symbol=0x555555d53f30, value=0x555560d3f318) at eval.c:5467
#45 getvar (symbol=0x555555d53f30, rho=0x555560d3f270, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#46 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#47 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24de50, rho=rho@entry=0x555560d3f270) at eval.c:1013
#48 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a24e748, newrho=newrho@entry=0x555560d3f270, sysparent=<optimized out>, rho=rho@entry=0x555560d3f388, arglist=arglist@entry=0x555560d3f2e0, op=op@entry=0x55555a24e1d0) at eval.c:2187
#49 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a24e748, op=op@entry=0x55555a24e1d0, arglist=arglist@entry=0x555560d3f2e0, rho=rho@entry=0x555560d3f388, suppliedvars=<optimized out>) at eval.c:2113
#50 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#51 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24e8d0, rho=rho@entry=0x555560d3f388) at eval.c:1013
#52 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a24f0e8, newrho=newrho@entry=0x555560d3f388, sysparent=<optimized out>, rho=rho@entry=0x555560d3f4a0, arglist=arglist@entry=0x555560d3f430, op=op@entry=0x55555a24ec88) at eval.c:2187
#53 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a24f0e8, op=op@entry=0x55555a24ec88, arglist=arglist@entry=0x555560d3f430, rho=rho@entry=0x555560d3f4a0, suppliedvars=<optimized out>) at eval.c:2113
#54 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#55 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24f190, rho=rho@entry=0x555560d3f4a0) at eval.c:1013
#56 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a249780, newrho=newrho@entry=0x555560d3f4a0, sysparent=<optimized out>, rho=rho@entry=0x555560d3f5f0, arglist=arglist@entry=0x555560d3f510, op=op@entry=0x55555a248c20) at eval.c:2187
#57 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a249780, op=op@entry=0x55555a248c20, arglist=arglist@entry=0x555560d3f510, rho=rho@entry=0x555560d3f5f0, suppliedvars=<optimized out>) at eval.c:2113
#58 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#59 0x00007ffff6d58b30 in Rf_eval (e=0x55555a249898, rho=rho@entry=0x555560d3f5f0) at eval.c:1013
#60 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a269a50, newrho=newrho@entry=0x555560d3f5f0, sysparent=<optimized out>, rho=rho@entry=0x555560b268a8, arglist=arglist@entry=0x555560d3f6d0, op=op@entry=0x55555a249cf8) at eval.c:2187
#61 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a269a50, op=op@entry=0x55555a249cf8, arglist=arglist@entry=0x555560d3f6d0, rho=rho@entry=0x555560b268a8, suppliedvars=<optimized out>) at eval.c:2113
#62 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#63 0x00007ffff6d58b30 in Rf_eval (e=0x55555a259820, rho=rho@entry=0x555560b268a8) at eval.c:1013
#64 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x555559ff0db8, newrho=newrho@entry=0x555560b268a8, sysparent=<optimized out>, rho=rho@entry=0x555560b26db0, arglist=arglist@entry=0x555560b26b10, op=op@entry=0x55555a259c48) at eval.c:2187
#65 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x555559ff0db8, op=op@entry=0x55555a259c48, arglist=arglist@entry=0x555560b26b10, rho=rho@entry=0x555560b26db0, suppliedvars=<optimized out>) at eval.c:2113
#66 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#67 0x00007ffff6d58b30 in Rf_eval (e=0x555559ff14f0, rho=rho@entry=0x555560b26db0) at eval.c:1013
#68 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x555559ff1560, newrho=newrho@entry=0x555560b26db0, sysparent=<optimized out>, rho=rho@entry=0x555560b22cb0, arglist=arglist@entry=0x555555d0a930, op=op@entry=0x555560b228c0) at eval.c:2187
#69 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x555559ff1560, op=op@entry=0x555560b228c0, arglist=arglist@entry=0x555555d0a930, rho=rho@entry=0x555560b22cb0, suppliedvars=<optimized out>) at eval.c:2113
#70 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#71 0x00007ffff6d58b30 in Rf_eval (e=0x555559ff5d28, rho=0x555560b22cb0) at eval.c:1013
#72 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b22380) at eval.c:833
#73 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b222a0, symbol=0x555555d5a128, value=0x555560b22380) at eval.c:5467
#74 getvar (symbol=0x555555d5a128, rho=0x555560b222a0, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#75 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#76 0x00007ffff6d58b30 in Rf_eval (e=0x55555a2586c8, rho=0x555560b222a0) at eval.c:1013
#77 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b27018) at eval.c:833
#78 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b26f70, symbol=0x555555d53f30, value=0x555560b27018) at eval.c:5467
#79 getvar (symbol=0x555555d53f30, rho=0x555560b26f70, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#80 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#81 0x00007ffff6d58b30 in Rf_eval (e=0x55555a25a2d8, rho=0x555560b26f70) at eval.c:1013
#82 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b26f38) at eval.c:833
#83 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b26de8, symbol=0x555555fa9fd0, value=0x555560b26f38) at eval.c:5467
#84 getvar (symbol=0x555555fa9fd0, rho=0x555560b26de8, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#85 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#86 0x00007ffff6d58b30 in Rf_eval (e=0x555555fa9f28, rho=rho@entry=0x555560b26de8) at eval.c:1013
#87 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a25a428, newrho=newrho@entry=0x555560b26de8, sysparent=<optimized out>, rho=rho@entry=0x555560b26f70, arglist=arglist@entry=0x555560b26f00, op=op@entry=0x555555faa040) at eval.c:2187
#88 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a25a428, op=op@entry=0x555555faa040, arglist=arglist@entry=0x555560b26f00, rho=rho@entry=0x555560b26f70, suppliedvars=<optimized out>) at eval.c:2113
#89 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#90 0x00007ffff6d58b30 in Rf_eval (e=0x55555a25a4d0, rho=rho@entry=0x555560b26f70) at eval.c:1013
#91 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a258cb0, newrho=newrho@entry=0x555560b26f70, sysparent=<optimized out>, rho=rho@entry=0x555560b222a0, arglist=arglist@entry=0x555560b26fe0, op=op@entry=0x55555a25a850) at eval.c:2187
#92 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a258cb0, op=op@entry=0x55555a25a850, arglist=arglist@entry=0x555560b26fe0, rho=rho@entry=0x555560b222a0, suppliedvars=<optimized out>) at eval.c:2113
#93 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#94 0x00007ffff6d58b30 in Rf_eval (e=0x55555a258700, rho=0x555560b222a0) at eval.c:1013
#95 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b22230) at eval.c:833
#96 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b22070, symbol=0x555555d5a128, value=0x555560b22230) at eval.c:5467
#97 getvar (symbol=0x555555d5a128, rho=0x555560b22070, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#98 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#99 0x00007ffff6d58b30 in Rf_eval (e=0x5555560b5f30, rho=0x555560b22070) at eval.c:1013
#100 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b21c48) at eval.c:833
#101 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b219a8, symbol=0x555555d5a128, value=0x555560b21c48) at eval.c:5467
#102 getvar (symbol=0x555555d5a128, rho=0x555560b219a8, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#103 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#104 0x00007ffff6d58b30 in Rf_eval (e=0x5555560afb90, rho=0x555560b219a8) at eval.c:1013
#105 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b21900) at eval.c:833
#106 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b21660, symbol=0x555555d5a128, value=0x555560b21900) at eval.c:5467
#107 getvar (symbol=0x555555d5a128, rho=0x555560b21660, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#108 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#109 0x00007ffff6d58b30 in Rf_eval (e=0x5555560b6278, rho=0x555560b21660) at eval.c:1013
#110 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b215b8) at eval.c:833
#111 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b272b8, symbol=0x555555d5a128, value=0x555560b215b8) at eval.c:5467
#112 getvar (symbol=0x555555d5a128, rho=0x555560b272b8, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#113 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#114 0x00007ffff6d58b30 in Rf_eval (e=0x5555560b65f8, rho=rho@entry=0x555560b272b8) at eval.c:1013
#115 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x5555560af1b8, newrho=newrho@entry=0x555560b272b8, sysparent=<optimized out>, rho=rho@entry=0x555560b21660, arglist=arglist@entry=0x555560b21580, op=op@entry=0x555560b21628) at eval.c:2187
#116 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x5555560af1b8, op=op@entry=0x555560b21628, arglist=arglist@entry=0x555560b21580, rho=rho@entry=0x555560b21660, suppliedvars=<optimized out>) at eval.c:2113
#117 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#118 0x00007ffff6d58b30 in Rf_eval (e=0x5555560af8b8, rho=rho@entry=0x555560b21660) at eval.c:1013
#119 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x5555560b0108, newrho=newrho@entry=0x555560b21660, sysparent=<optimized out>, rho=rho@entry=0x555560b219a8, arglist=arglist@entry=0x555560b218c8, op=op@entry=0x555560b21fc8) at eval.c:2187
#120 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x5555560b0108, op=op@entry=0x555560b21fc8, arglist=arglist@entry=0x555560b218c8, rho=rho@entry=0x555560b219a8, suppliedvars=<optimized out>) at eval.c:2113
#121 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#122 0x00007ffff6d58b30 in Rf_eval (e=0x5555560ac868, rho=rho@entry=0x555560b219a8) at eval.c:1013
#123 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x5555560aca60, newrho=newrho@entry=0x555560b219a8, sysparent=<optimized out>, rho=rho@entry=0x555560b22070, arglist=arglist@entry=0x555560b21c10, op=op@entry=0x555560b22038) at eval.c:2187
#124 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x5555560aca60, op=op@entry=0x555560b22038, arglist=arglist@entry=0x555560b21c10, rho=rho@entry=0x555560b22070, suppliedvars=<optimized out>) at eval.c:2113
#125 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#126 0x00007ffff6d58b30 in Rf_eval (e=0x55555609d6f0, rho=rho@entry=0x555560b22070) at eval.c:1013
#127 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a24d3d0, newrho=newrho@entry=0x555560b22070, sysparent=<optimized out>, rho=rho@entry=0x555560b222a0, arglist=arglist@entry=0x555560b221f8, op=op@entry=0x55555609d840) at eval.c:2187
#128 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a24d3d0, op=op@entry=0x55555609d840, arglist=arglist@entry=0x555560b221f8, rho=rho@entry=0x555560b222a0, suppliedvars=<optimized out>) at eval.c:2113
#129 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#130 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24d478, rho=rho@entry=0x555560b222a0) at eval.c:1013
#131 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x555559ff15d0, newrho=newrho@entry=0x555560b222a0, sysparent=<optimized out>, rho=rho@entry=0x555560b22cb0, arglist=arglist@entry=0x555560b22348, op=op@entry=0x55555a24d830) at eval.c:2187
#132 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x555559ff15d0, op=op@entry=0x55555a24d830, arglist=arglist@entry=0x555560b22348, rho=rho@entry=0x555560b22cb0, suppliedvars=<optimized out>) at eval.c:2113
#133 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#134 0x00007ffff6d58b30 in Rf_eval (e=0x555559ff5d60, rho=0x555560b22cb0) at eval.c:1013
#135 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b22850) at eval.c:833
#136 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b22700, symbol=0x555556d65bd0, value=0x555560b22850) at eval.c:5467
#137 getvar (symbol=0x555556d65bd0, rho=0x555560b22700, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#138 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#139 0x00007ffff6d58b30 in Rf_eval (e=0x55555a249240, rho=0x555560b22700) at eval.c:1013
#140 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b22690) at eval.c:833
#141 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b225e8, symbol=0x555555d53f30, value=0x555560b22690) at eval.c:5467
#142 getvar (symbol=0x555555d53f30, rho=0x555560b225e8, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#143 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#144 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24ef28, rho=0x555560b225e8) at eval.c:1013
#145 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b225b0) at eval.c:833
#146 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b224d0, symbol=0x555555d53f30, value=0x555560b225b0) at eval.c:5467
#147 getvar (symbol=0x555555d53f30, rho=0x555560b224d0, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#148 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#149 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24e550, rho=0x555560b224d0) at eval.c:1013
#150 0x00007ffff6d594e4 in forcePromise (e=e@entry=0x555560b22460) at eval.c:833
#151 0x00007ffff6d597c8 in FORCE_PROMISE (keepmiss=FALSE, rho=0x555560b223b8, symbol=0x555555d53f30, value=0x555560b22460) at eval.c:5467
#152 getvar (symbol=0x555555d53f30, rho=0x555560b223b8, dd=<optimized out>, keepmiss=FALSE, vcache=<optimized out>, sidx=<optimized out>) at eval.c:5508
#153 0x00007ffff6d44b72 in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7198
#154 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24de50, rho=rho@entry=0x555560b223b8) at eval.c:1013
#155 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a24e748, newrho=newrho@entry=0x555560b223b8, sysparent=<optimized out>, rho=rho@entry=0x555560b224d0, arglist=arglist@entry=0x555560b22428, op=op@entry=0x55555a24e1d0) at eval.c:2187
#156 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a24e748, op=op@entry=0x55555a24e1d0, arglist=arglist@entry=0x555560b22428, rho=rho@entry=0x555560b224d0, suppliedvars=<optimized out>) at eval.c:2113
#157 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#158 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24e8d0, rho=rho@entry=0x555560b224d0) at eval.c:1013
#159 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a24f0e8, newrho=newrho@entry=0x555560b224d0, sysparent=<optimized out>, rho=rho@entry=0x555560b225e8, arglist=arglist@entry=0x555560b22578, op=op@entry=0x55555a24ec88) at eval.c:2187
#160 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a24f0e8, op=op@entry=0x55555a24ec88, arglist=arglist@entry=0x555560b22578, rho=rho@entry=0x555560b225e8, suppliedvars=<optimized out>) at eval.c:2113
#161 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#162 0x00007ffff6d58b30 in Rf_eval (e=0x55555a24f190, rho=rho@entry=0x555560b225e8) at eval.c:1013
#163 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x55555a249780, newrho=newrho@entry=0x555560b225e8, sysparent=<optimized out>, rho=rho@entry=0x555560b22700, arglist=arglist@entry=0x555560b22658, op=op@entry=0x55555a248c20) at eval.c:2187
#164 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x55555a249780, op=op@entry=0x55555a248c20, arglist=arglist@entry=0x555560b22658, rho=rho@entry=0x555560b22700, suppliedvars=<optimized out>) at eval.c:2113
#165 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#166 0x00007ffff6d58b30 in Rf_eval (e=0x55555a249898, rho=rho@entry=0x555560b22700) at eval.c:1013
#167 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x555559ff1640, newrho=newrho@entry=0x555560b22700, sysparent=<optimized out>, rho=rho@entry=0x555560b22cb0, arglist=arglist@entry=0x555560b22818, op=op@entry=0x55555a249cf8) at eval.c:2187
#168 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x555559ff1640, op=op@entry=0x55555a249cf8, arglist=arglist@entry=0x555560b22818, rho=rho@entry=0x555560b22cb0, suppliedvars=<optimized out>) at eval.c:2113
#169 0x00007ffff6d49b2d in bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:7414
#170 0x00007ffff6d58b30 in Rf_eval (e=0x555559feee60, rho=rho@entry=0x555560b22cb0) at eval.c:1013
#171 0x00007ffff6d5abc6 in R_execClosure (call=call@entry=0x555560b22ea8, newrho=newrho@entry=0x555560b22cb0, sysparent=<optimized out>, rho=rho@entry=0x555555d41e68, arglist=arglist@entry=0x555560b22e00, op=op@entry=0x555559fef020) at eval.c:2187
#172 0x00007ffff6d5b9f5 in Rf_applyClosure (call=call@entry=0x555560b22ea8, op=op@entry=0x555559fef020, arglist=arglist@entry=0x555560b22e00, rho=rho@entry=0x555555d41e68, suppliedvars=<optimized out>) at eval.c:2113
#173 0x00007ffff6d58c5c in Rf_eval (e=0x555560b22ea8, rho=0x555555d41e68) at eval.c:1140
#174 0x00007ffff6ce70c2 in protectedEval (d=d@entry=0x7fffffffcaa0) at context.c:851
#175 0x00007ffff6ce8a1a in R_ToplevelExec (fun=fun@entry=0x7ffff6ce70a0 <protectedEval>, data=data@entry=0x7fffffffcaa0) at context.c:799
#176 0x00007ffff6ce8a8d in R_tryEval (e=<optimized out>, env=<optimized out>, ErrorOccurred=0x555560322390) at context.c:865
#177 0x00007ffff7bce5bf in _cffi_f_R_tryEval (self=<optimized out>, args=<optimized out>) at build/temp.linux-x86_64-cpython-310/_rinterface_cffi_api.c:3402
etiennebacher commented 9 months ago

Thanks for the report. Unfortunately I have this issue when I try to install rpy2: https://github.com/rpy2/rpy2/issues/1044

Downgrading pip and rpy2 didn't solve it so I can't reproduce this bug

lgautier commented 9 months ago

Thanks for the report. Unfortunately I have this issue when I try to install rpy2: rpy2/rpy2#1044

Downgrading pip and rpy2 didn't solve it so I can't reproduce this bug

Hi, rpy2 is mostly not supported on Windows, unless you use WSL.

etiennebacher commented 9 months ago

I should have access to a machine with Ubuntu in the next few days, I'll see then (unless @eitsupi deals with this first)

eitsupi commented 9 months ago

Thanks for the report. This is curious because it seems to read through an Arrow file with no problem.

>>> import polars as pl
>>> import pyarrow.feather as feather
>>> 
>>> pl.DataFrame({'a': ['wx', 'yz', 'wx']}, schema = {'a': pl.Categorical}).write_ipc("test.arrow")
>>> feather.read_table("test.arrow")
pyarrow.Table
a: dictionary<values=large_string, indices=uint32, ordered=0>
----
a: [  -- dictionary:
["wx","yz"]  -- indices:
[0,1,0]]
> arrow::read_ipc_file("test.arrow")
Error: Cannot convert Dictionary Array of type `dictionary<values=large_string, indices=uint32, ordered=0>` to R

> polars::pl$scan_ipc("test.arrow")$collect()
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ cat │
╞═════╡
│ wx  │
│ yz  │
│ wx  │
└─────┘

I don't know what's in rpy2arrow.pyarrow_table_to_r_table, but is it possible that it is through the R arrow package? Maybe related to apache/arrow#39603

Seems coming from here https://github.com/apache/arrow/blob/05b8f366e17ee6f21df4746bb6a65be399dfb68d/r/R/arrowExports.R#L311-L313

> arrow::read_ipc_file("test.arrow", as_data_frame = FALSE) |> polars::as_polars_df()

 *** caught segfault ***
address 0x28, cause 'memory not mapped'

Traceback:
 1: (function (array, array_ptr, schema_ptr) {    invisible(.Call(`_arrow_ExportArray`, array, array_ptr, schema_ptr))})(<environment>, <pointer: 0x7fe103882000>, <pointer: 0x7fe103882050>)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
eitsupi commented 9 months ago

Hmmm, the conversion between R arrow and pyarrow seems to be fine, so it may be a problem of this package.

> at <- arrow::read_ipc_file("test.arrow", as_data_frame = FALSE)

> reticulate::r_to_py(at)
pyarrow.Table
a: dictionary<values=large_string, indices=uint32, ordered=0>
----
a: [  -- dictionary:
["wx","yz"]  -- indices:
[0,1,0]]

> reticulate::r_to_py(at) |> reticulate::py_to_r()
Table
3 rows x 1 columns
$a <dictionary<values=large_string, indices=uint32>>

The conversion from chunck to Series seems to be working well.

> polars::.pr$Series$from_arrow("foo", at$a$chunks[[1]])
$ok
polars Series: shape: (3,)
Series: 'foo' [cat]
[
        "wx"
        "yz"
        "wx"
]

$err
NULL

attr(,"class")
[1] "extendr_result"

Also works.

> rbr <- arrow::read_ipc_file("test.arrow", as_data_frame = FALSE) |> arrow::as_record_batch_reader()

> polars::.pr$DataFrame$from_arrow_record_batches(rbr$batches())
$ok
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ cat │
╞═════╡
│ wx  │
│ yz  │
│ wx  │
└─────┘

$err
NULL

attr(,"class")
[1] "extendr_result"

Crashes.

> at <- arrow::read_ipc_file("test.arrow", as_data_frame = FALSE)

> polars:::arrow_to_rdf(at)

 *** caught segfault ***
address 0x28, cause 'memory not mapped'

Traceback:
 1: (function (array, array_ptr, schema_ptr) {    invisible(.Call(`_arrow_ExportArray`, array, array_ptr, schema_ptr))})(<environment>, <pointer: 0x7f9cba482000>, <pointer: 0x7f9cba482050>)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

This is probably what is causing the problem.

https://github.com/pola-rs/r-polars/blob/fd7e03c94d2636fce23207d2a2e525d6fe12765d/R/construction.R#L21-L24

> at$a |> polars:::is_arrow_dictonary()
[1] TRUE
> at$a |> polars:::arrow_to_rseries_result("foo", values = _, rechunk = TRUE)

 *** caught segfault ***
address 0x28, cause 'memory not mapped'

Traceback:
 1: (function (array, array_ptr, schema_ptr) {    invisible(.Call(`_arrow_ExportArray`, array, array_ptr, schema_ptr))})(<environment>, <pointer: 0x7fc9da882000>, <pointer: 0x7fc9da882050>)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

https://github.com/pola-rs/r-polars/blob/fd7e03c94d2636fce23207d2a2e525d6fe12765d/R/construction.R#L137-L167

eitsupi commented 9 months ago

Related to #497

eitsupi commented 9 months ago

Hi @lgautier, this bug has been fixed in the main branch and will be included in the next release. Since the next release will include the deprecation of pl$from_arrow (#728), we would appreciate it if you would use as_polars_df instead of pl$from_arrow.

lgautier commented 9 months ago

Thanks @eitsupi . That was quick!

Three questions:

eitsupi commented 9 months ago
  • Is there a binary build I can already try?

Unfortunately, we haven't built any binaries on the main branch (I'll add CI next time), so there are no binaries to try right away.

  • Will the next release be 0.13 rather than 0.12.3 (since deprecation would break the API)?

The next release will be 0.13.0. There are currently few breaking changes (See the NEWS.md file), but there is a possibility that Rust Polars will be updated before release, which may introduce more breaking changes.

  • Do you have a rough date estimate for that next release?

Probably a few days to a week?

@etiennebacher Any thoughts on the next release? To be honest, I don't have the energy to update Rust Polars right now, so I think it's okay to release 0.13.0 right away and hold off on updating Rust Polars until the 0.13.x or 0.14.0.

etiennebacher commented 9 months ago

I think the next rust-polars release will not be before 2-3 weeks at least (but it's a bit uncertain of course).

On our side, I think we could make a new release in 1-2 weeks. I'd like to tackle the envvars handling so that we can release 0.13.0 with good docs for envvars and options, but the next week is gonna be quite busy for me. In any case, we have enough stuff to release so the next rust-polars update could go in 0.14.0

eitsupi commented 9 months ago

@lgautier The new version binaries can be installed from R-universe for now.

lgautier commented 9 months ago

Thanks a lot for the update. I was meant to add last week that a release in several weeks works, and that in the in the meantime binary builds for main (even if a nightly frequency-wise) would help rpy2-arrow prepare for it.