pola-rs / r-polars

Polars R binding
https://pola-rs.github.io/r-polars/
Other
473 stars 35 forks source link

Improve handling of Polars Int64 to R #465

Closed eitsupi closed 9 months ago

eitsupi commented 12 months ago

Discussed in #455

Since R doesn't have the Int64 type and the bit64 package is not a hard dependency, it would be great if they could improve the process of converting Int64 to the R type.

sorhawell commented 12 months ago

for many use cases I think converting i64 to f64 is a good trade-off as it full precision until 2^53, and larger numbers are rarely needed. If the i64/u64 is sampled from a uniform distribution, then the vast majority of numbers between 2^53 and 2^64.

We could have an options like

pl$set_options(
    bigint_conversion = c("bit64", "real", "string"),
    overflow = c("error", "warning", "silent")
)
eitsupi commented 11 months ago

Agree. Perhaps we can use DuckDB's strategy here.

etiennebacher commented 11 months ago

Agreed with using set_options().

I'm not sure where the conversion should happen. The i64 columns only become a problem when we convert them to R so we should only need to handle them in to_r() and to_data_frame(), right?

eitsupi commented 11 months ago

we should only need to handle them in to_r() and to_data_frame(), right?

I think so.

sorhawell commented 11 months ago

we can already control behavior today here

But I would like to refactor it to a proper option.

eitsupi commented 11 months ago

nanoarrow also converts int64 to numeric() by default and supports bit64::integer64() optional.

nanoarrow::infer_nanoarrow_ptype(nanoarrow::na_int64())
#> numeric(0)

na <- nanoarrow::as_nanoarrow_array(10, schema = nanoarrow::na_int64())
na
#> <nanoarrow_array int64[1]>
#>  $ length    : int 1
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 2
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data<int64>[1][8 b]> `10`
#>  $ dictionary: NULL
#>  $ children  : list()

nanoarrow::convert_array(na) |> class()
#> [1] "numeric"

nanoarrow::convert_array(na, bit64::integer64()) |> class()
#> [1] "integer64"

Created on 2023-11-13 with reprex v2.0.2

etiennebacher commented 11 months ago

But I would like to refactor it to a proper option.

@sorhawell FYI if you're too busy, I can take a stab at this, just let me know

sorhawell commented 11 months ago

My kid got a bit sick now, so will not get to do anything after all tonight :) Be my guest to give it a go <3 I spam you some thoughts about how it could be implemented.

But any incremental improvement is appreciated.

In generel we need to decide how to pass option state into rust side. This current pattern controlling via bigint/int64 args will bloat the function signatures.

Let's call that "type 1 - set option via arg"

Some option states like global_string_cache must be threadsafe and has its own dedicated rust implementation with mutex, atomics and what not. Let's call that "type 2 - set/get option via rust dedicated (thread safe?) state"

This option bigint I think does not need to be thread safe as implemented Series to R vector is mostly single threaded. I think one day we could rewrite multithreaded. Only transferring of ownership to R needs to be single threaded. Anyways.

We can set the non thread safe states as environment variables in R (Sys.setenv) and read these from rust. Py-polars does this also alot and we should for most part just follow py-polars. Let's call this "type 3 - set option via envvar" Type 3 can also be set via environment variables from bash before starting R and using polars. Py-polars require prefixing all environment variables outside python with POLARS_ and maps the environment variable name. Type 3 should also be settable via pl$set_options the state is just stored in an environment variable and is empty string "" if undefined. Issue on using more environment variables #95.

The last variant is "type 4 set option on R side" we currently have an R-environment (hashmap) were these are stored.

I think in future pl$set_options will need to handle type 2,3,4 and perhaps 1.