Closed eitsupi closed 9 months ago
for many use cases I think converting i64 to f64 is a good trade-off as it full precision until 2^53, and larger numbers are rarely needed. If the i64/u64 is sampled from a uniform distribution, then the vast majority of numbers between 2^53 and 2^64.
We could have an options like
pl$set_options(
bigint_conversion = c("bit64", "real", "string"),
overflow = c("error", "warning", "silent")
)
Agree. Perhaps we can use DuckDB's strategy here.
Agreed with using set_options()
.
I'm not sure where the conversion should happen. The i64
columns only become a problem when we convert them to R so we should only need to handle them in to_r()
and to_data_frame()
, right?
we should only need to handle them in
to_r()
andto_data_frame()
, right?
I think so.
we can already control behavior today here
But I would like to refactor it to a proper option.
nanoarrow
also converts int64 to numeric()
by default and supports bit64::integer64()
optional.
nanoarrow::infer_nanoarrow_ptype(nanoarrow::na_int64())
#> numeric(0)
na <- nanoarrow::as_nanoarrow_array(10, schema = nanoarrow::na_int64())
na
#> <nanoarrow_array int64[1]>
#> $ length : int 1
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 2
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> ..$ :<nanoarrow_buffer data<int64>[1][8 b]> `10`
#> $ dictionary: NULL
#> $ children : list()
nanoarrow::convert_array(na) |> class()
#> [1] "numeric"
nanoarrow::convert_array(na, bit64::integer64()) |> class()
#> [1] "integer64"
Created on 2023-11-13 with reprex v2.0.2
But I would like to refactor it to a proper option.
@sorhawell FYI if you're too busy, I can take a stab at this, just let me know
My kid got a bit sick now, so will not get to do anything after all tonight :) Be my guest to give it a go <3 I spam you some thoughts about how it could be implemented.
But any incremental improvement is appreciated.
In generel we need to decide how to pass option state into rust side. This current pattern controlling via bigint/int64 args will bloat the function signatures.
Let's call that "type 1 - set option via arg"
Some option states like global_string_cache must be threadsafe and has its own dedicated rust implementation with mutex, atomics and what not. Let's call that "type 2 - set/get option via rust dedicated (thread safe?) state"
This option bigint I think does not need to be thread safe as implemented Series to R vector is mostly single threaded. I think one day we could rewrite multithreaded. Only transferring of ownership to R needs to be single threaded. Anyways.
We can set the non thread safe states as environment variables in R (Sys.setenv) and read these from rust. Py-polars does this also alot and we should for most part just follow py-polars. Let's call this "type 3 - set option via envvar" Type 3 can also be set via environment variables from bash before starting R and using polars. Py-polars require prefixing all environment variables outside python with POLARS_ and maps the environment variable name. Type 3 should also be settable via pl$set_options the state is just stored in an environment variable and is empty string "" if undefined. Issue on using more environment variables #95.
The last variant is "type 4 set option on R side" we currently have an R-environment (hashmap) were these are stored.
I think in future pl$set_options will need to handle type 2,3,4 and perhaps 1.
Discussed in #455
Since R doesn't have the Int64 type and the
bit64
package is not a hard dependency, it would be great if they could improve the process of converting Int64 to the R type.