Closed meztez closed 12 months ago
Looks like we could just call naive_time_parse()
directly? (And a nice side effect would be that we don't need to set the format
string)
clock:::as.POSIXct.clock_zoned_time
would still drop the seconds if I understand correctly. My comprehension is that since POSIXct is a seconds data type? decimals get dropped.
This line here: https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/vignettes/faq.Rmd#L236 https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/R/posixt.R#L143
Oh hmm, hopefully @davisvaughan can suggest something for us.
Yea clock aggressively asserts that POSIXct is "seconds" precision due to:
But I realize sometimes you just want the "good enough" subsecond POSIXct, so you can use something like this. Ironically it is sometimes faster than date_time_parse()
because we get to take a shortcut through sys-time rather than naive-time since we know we are going to UTC.
bq_datetime_parse <- function(x) {
# `format` matches clock already.
# Bigquery DATETIME is documented as microsecond precision.
# https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#datetime_type
x <- clock::year_month_day_parse(x, precision = "microsecond")
# Manually retain microseconds for the POSIXct
microseconds <- clock::get_microsecond(x)
microseconds <- microseconds / 1000000
# Convert to POSIXct at second precision since that is what clock asserts
# the precision of a POSIXct is. Can use sys-time since we are going straight
# to UTC.
x <- clock::calendar_narrow(x, "second")
x <- clock::as_sys_time(x)
x <- as.POSIXct(x, tz = "UTC")
# Manually add microseconds back on (lossy, floating point issues!)
x <- x + microseconds
x
}
options(digits.secs = 22)
x <- c(
"2000-01-02T03:04:05.67",
"2000-01-02T03:04:05",
"2000-01-02T03:04:05.123456"
)
# note how this isn't exactly right! floating point errors!
bq_datetime_parse(x)
#> [1] "2000-01-02 03:04:05.669999 UTC" "2000-01-02 03:04:05.000000 UTC"
#> [3] "2000-01-02 03:04:05.123456 UTC"
# but not any worse than base R
as.POSIXct(x, tz = "UTC", format = "%Y-%m-%dT%H:%M:%OS")
#> [1] "2000-01-02 03:04:05.669999 UTC" "2000-01-02 03:04:05.000000 UTC"
#> [3] "2000-01-02 03:04:05.123456 UTC"
# probably around the same speed as the previous way
set.seed(123)
x <- .POSIXct(sample(10000, 1e6, replace = TRUE), tz = "UTC")
x <- clock::date_format(x, format = "%Y-%m-%dT%H:%M:%S")
bench::mark(
clock::date_time_parse(x, format = "%Y-%m-%dT%H:%M:%S", zone = "UTC"),
bq_datetime_parse(x),
iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 "clock::date_time_parse(x, format =… 391ms 397ms 2.50 68.8MB 5.09
#> 2 "bq_datetime_parse(x)" 391ms 399ms 2.46 99.2MB 6.59
Created on 2023-12-04 with reprex v2.0.2
https://github.com/r-dbi/bigrquery/blob/adfef86c6e9ef0bb32e1995fabc07f9741e03cf3/R/bq-download.R#L217
because of "seconds" precision in
clock::date_time_parse
?https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/R/posixt.R#L1430
I do not have a solution yet, but results differ from what arrow would return using
bigrquerystorage
.https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/vignettes/faq.Rmd#L216