pola-rs / r-polars

Polars R binding
https://pola-rs.github.io/r-polars/
Other
470 stars 36 forks source link

Support the `clock` package's types #591

Open eitsupi opened 10 months ago

eitsupi commented 10 months ago

It may also make sense to support conversions from the classes provided by the clock package, which has a time type in ns and would be more appropriate for time zone handling.

Originally posted by @eitsupi in https://github.com/pola-rs/r-polars/issues/578#issuecomment-1847203094

sorhawell commented 10 months ago

Did not know of the clock package, but it seems nice.

It seems the clock representations are R numeric (possibly pseudo integers) with an upper and lower part to have an internal precision of ~ 2^(52+52) which well over e.g. u64 nanoseconds of 2^(64). The downside is less performance. The upside is anyone can fairly easily tinker with the internals.

Any time precision day , second, nanosecond... have the same class but only one variable in difference called precision which is an enum-like R integer. The sub components are glued together to a single vector interface via vctrs::vctrs_rcrd.

I did not find the conversion arithmetics yet, but this should be fairly straight forward conversion using the extendr-api.

It seems year 3712 is not supported as the datetime overflows back 1958?! with no warning?!?!. This is surprising both overflowing without warning for package that is not designed for speed, but I imagined if allocating the lower part for nanoseconds, then the upper part can describe seconds since origin the should be some ~140 millions years of range.

both "s" and "ns" has issues also with "0001-01-01 01:01:01.000000001" whereas "ms" and "us" works fine. This gotta be a bug, I think.

char_times = c(
  "0001-01-01 01:01:01.000000001",
  "2212-01-01 12:34:57.123456789",
  "3712-01-01 12:34:56.123456789"
)
fmt = "%Y-%m-%d %H:%M:%OS"
clock_times = list(
  ns = clock::naive_time_parse(char_times , format = fmt, precision = "nanosecond"),
  us =  clock::naive_time_parse(char_times , format = fmt, precision = "microsecond"),
  ms =  clock::naive_time_parse(char_times , format = fmt, precision = "millisecond"),
  s = clock::naive_time_parse(char_times , format = fmt, precision = "nanosecond"),
  d = clock::naive_time_parse(char_times , format = fmt, precision = "day")
)

clock_times
#> $ns
#> <naive_time<nanosecond>[3]>
#> [1] "1754-08-30T23:44:42.128654849" "2212-01-01T12:34:57.123456789"
#> [3] "1958-05-04T13:51:14.994801941"
#> 
#> $us
#> <naive_time<microsecond>[3]>
#> [1] "0001-01-01T01:01:01.000000" "2212-01-01T12:34:57.123456"
#> [3] "3712-01-01T12:34:56.123456"
#> 
#> $ms
#> <naive_time<millisecond>[3]>
#> [1] "0001-01-01T01:01:01.000" "2212-01-01T12:34:57.123"
#> [3] "3712-01-01T12:34:56.123"
#> 
#> $s
#> <naive_time<nanosecond>[3]>
#> [1] "1754-08-30T23:44:42.128654849" "2212-01-01T12:34:57.123456789"
#> [3] "1958-05-04T13:51:14.994801941"
#> 
#> $d
#> <naive_time<day>[3]>
#> [1] "0001-01-01" "2212-01-02" "3712-01-02"
lapply(clock_times,\(x) unclass(x) |> str()) |> invisible()
#> List of 2
#>  $ lower: num [1:3] 5.65e+08 3.93e+09 2.06e+09
#>  $ upper: num [1:3] 2.71e+09 2.67e+09 1.72e+09
#>  - attr(*, "precision")= int 10
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.13e+09 2.15e+09 2.16e+09
#>  $ upper: num [1:3] 3.67e+09 3.11e+09 3.96e+09
#>  - attr(*, "precision")= int 9
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.15e+09 2.15e+09 2.15e+09
#>  $ upper: num [1:3] 3.99e+09 3.17e+08 9.32e+08
#>  - attr(*, "precision")= int 8
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 5.65e+08 3.93e+09 2.06e+09
#>  $ upper: num [1:3] 2.71e+09 2.67e+09 1.72e+09
#>  - attr(*, "precision")= int 10
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.15e+09 2.15e+09 2.15e+09
#>  $ upper: num [1:3] 4.29e+09 8.84e+04 6.36e+05
#>  - attr(*, "precision")= int 4
#>  - attr(*, "clock")= int 1
lapply(clock_times, class)
#> $ns
#> [1] "clock_naive_time" "clock_time_point" "clock_rcrd"       "vctrs_rcrd"      
#> [5] "vctrs_vctr"      
#> 
#> $us
#> [1] "clock_naive_time" "clock_time_point" "clock_rcrd"       "vctrs_rcrd"      
#> [5] "vctrs_vctr"      
#> 
#> $ms
#> [1] "clock_naive_time" "clock_time_point" "clock_rcrd"       "vctrs_rcrd"      
#> [5] "vctrs_vctr"      
#> 
#> $s
#> [1] "clock_naive_time" "clock_time_point" "clock_rcrd"       "vctrs_rcrd"      
#> [5] "vctrs_vctr"      
#> 
#> $d
#> [1] "clock_naive_time" "clock_time_point" "clock_rcrd"       "vctrs_rcrd"      
#> [5] "vctrs_vctr"

Created on 2023-12-12 with reprex v2.0.2

eitsupi commented 7 months ago

Correct example (in the example above, second is typod to nanosecond)

char_times = c(
  "0001-01-01 01:01:01.000000001",
  "2212-01-01 12:34:57.123456789",
  "3712-01-01 12:34:56.123456789"
)
fmt = "%Y-%m-%d %H:%M:%OS"
clock_times = list(
  ns = clock::naive_time_parse(char_times , format = fmt, precision = "nanosecond"),
  us =  clock::naive_time_parse(char_times , format = fmt, precision = "microsecond"),
  ms =  clock::naive_time_parse(char_times , format = fmt, precision = "millisecond"),
  s = clock::naive_time_parse(char_times , format = fmt, precision = "second"),
  d = clock::naive_time_parse(char_times , format = fmt, precision = "day")
)

clock_times
#> $ns
#> <naive_time<nanosecond>[3]>
#> [1] "1754-08-30T23:44:42.128654849" "2212-01-01T12:34:57.123456789"
#> [3] "1958-05-04T13:51:14.994801941"
#>
#> $us
#> <naive_time<microsecond>[3]>
#> [1] "0001-01-01T01:01:01.000000" "2212-01-01T12:34:57.123456"
#> [3] "3712-01-01T12:34:56.123456"
#>
#> $ms
#> <naive_time<millisecond>[3]>
#> [1] "0001-01-01T01:01:01.000" "2212-01-01T12:34:57.123"
#> [3] "3712-01-01T12:34:56.123"
#>
#> $s
#> <naive_time<second>[3]>
#> [1] "0001-01-01T01:01:01" "2212-01-01T12:34:57" "3712-01-01T12:34:56"
#>
#> $d
#> <naive_time<day>[3]>
#> [1] "0001-01-01" "2212-01-02" "3712-01-02"

lapply(clock_times,\(x) unclass(x) |> str()) |> invisible()
#> List of 2
#>  $ lower: num [1:3] 5.65e+08 3.93e+09 2.06e+09
#>  $ upper: num [1:3] 2.71e+09 2.67e+09 1.72e+09
#>  - attr(*, "precision")= int 10
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.13e+09 2.15e+09 2.16e+09
#>  $ upper: num [1:3] 3.67e+09 3.11e+09 3.96e+09
#>  - attr(*, "precision")= int 9
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.15e+09 2.15e+09 2.15e+09
#>  $ upper: num [1:3] 3.99e+09 3.17e+08 9.32e+08
#>  - attr(*, "precision")= int 8
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.15e+09 2.15e+09 2.15e+09
#>  $ upper: num [1:3] 2.29e+09 3.34e+09 3.43e+09
#>  - attr(*, "precision")= int 7
#>  - attr(*, "clock")= int 1
#> List of 2
#>  $ lower: num [1:3] 2.15e+09 2.15e+09 2.15e+09
#>  $ upper: num [1:3] 4.29e+09 8.84e+04 6.36e+05
#>  - attr(*, "precision")= int 4
#>  - attr(*, "clock")= int 1

Created on 2024-02-27 with reprex v2.0.2

etiennebacher commented 7 months ago

@eitsupi can this closed too?

eitsupi commented 7 months ago

We need to be able to convert Polars Datetime types to clock types, just as we provide multiple ways to convert Int64. Currently no such API is provided, so we need to wait for an update on the clock side (r-lib/clock#365).

eitsupi commented 1 month ago

e5898b4518cc72ec20890aad0c92e64c41c5d92a supports exporting datetime as clock naive time/zoned time.

eitsupi commented 1 month ago

fa157e680724e8a32fbb84f4fa12c52fd92917b0 supports importing clock_time_point as datetime in the Rust side. It seems to be twice as fast as the current implementation, which is implemented only on the R side.

library(clock)

time_clock <- seq_len(10^5) |>
  as.POSIXct(tz = "UTC") |>
  as_zoned_time()

bench::mark(
  main = {
    polars::as_polars_series(time_clock)
  },
  neo = {
    neopolars::as_polars_series(time_clock)
  },
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 main        123.8ms  125.4ms      7.97   12.89MB     7.97
#> 2 neo          59.8ms   60.4ms     15.8     5.08MB     2.26

Created on 2024-09-01 with reprex v2.1.1