r-dbi / bigrquery

An interface to Google's BigQuery from R.
https://bigrquery.r-dbi.org
Other
517 stars 182 forks source link

Seconds decimal dropped in new parsing #591

Closed meztez closed 12 months ago

meztez commented 1 year ago

https://github.com/r-dbi/bigrquery/blob/adfef86c6e9ef0bb32e1995fabc07f9741e03cf3/R/bq-download.R#L217

ISOdatetime(2000, 1, 2, 3, 4, 5.67, tz = "UTC") - clock::date_time_parse("2000-01-02T03:04:05.67", format = "%Y-%m-%dT%H:%M:%S", zone = "UTC")

because of "seconds" precision in clock::date_time_parse?

https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/R/posixt.R#L1430

I do not have a solution yet, but results differ from what arrow would return using bigrquerystorage.

https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/vignettes/faq.Rmd#L216

hadley commented 1 year ago

Looks like we could just call naive_time_parse() directly? (And a nice side effect would be that we don't need to set the format string)

meztez commented 1 year ago

clock:::as.POSIXct.clock_zoned_time would still drop the seconds if I understand correctly. My comprehension is that since POSIXct is a seconds data type? decimals get dropped.

This line here: https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/vignettes/faq.Rmd#L236 https://github.com/r-lib/clock/blob/adc01b61670b18463cc3087f1e58acf59ddc3915/R/posixt.R#L143

hadley commented 1 year ago

Oh hmm, hopefully @davisvaughan can suggest something for us.

DavisVaughan commented 12 months ago

Yea clock aggressively asserts that POSIXct is "seconds" precision due to:

But I realize sometimes you just want the "good enough" subsecond POSIXct, so you can use something like this. Ironically it is sometimes faster than date_time_parse() because we get to take a shortcut through sys-time rather than naive-time since we know we are going to UTC.

bq_datetime_parse <- function(x) {
  # `format` matches clock already.
  # Bigquery DATETIME is documented as microsecond precision.
  # https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#datetime_type
  x <- clock::year_month_day_parse(x, precision = "microsecond")

  # Manually retain microseconds for the POSIXct
  microseconds <- clock::get_microsecond(x)
  microseconds <- microseconds / 1000000

  # Convert to POSIXct at second precision since that is what clock asserts
  # the precision of a POSIXct is. Can use sys-time since we are going straight
  # to UTC.
  x <- clock::calendar_narrow(x, "second")
  x <- clock::as_sys_time(x)
  x <- as.POSIXct(x, tz = "UTC")

  # Manually add microseconds back on (lossy, floating point issues!)
  x <- x + microseconds

  x
}

options(digits.secs = 22)

x <- c(
  "2000-01-02T03:04:05.67",
  "2000-01-02T03:04:05",
  "2000-01-02T03:04:05.123456"
)

# note how this isn't exactly right! floating point errors!
bq_datetime_parse(x)
#> [1] "2000-01-02 03:04:05.669999 UTC" "2000-01-02 03:04:05.000000 UTC"
#> [3] "2000-01-02 03:04:05.123456 UTC"

# but not any worse than base R
as.POSIXct(x, tz = "UTC", format = "%Y-%m-%dT%H:%M:%OS")
#> [1] "2000-01-02 03:04:05.669999 UTC" "2000-01-02 03:04:05.000000 UTC"
#> [3] "2000-01-02 03:04:05.123456 UTC"

# probably around the same speed as the previous way
set.seed(123)
x <- .POSIXct(sample(10000, 1e6, replace = TRUE), tz = "UTC")
x <- clock::date_format(x, format = "%Y-%m-%dT%H:%M:%S")

bench::mark(
  clock::date_time_parse(x, format = "%Y-%m-%dT%H:%M:%S", zone = "UTC"),
  bq_datetime_parse(x),
  iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "clock::date_time_parse(x, format =… 391ms  397ms      2.50    68.8MB     5.09
#> 2 "bq_datetime_parse(x)"               391ms  399ms      2.46    99.2MB     6.59

Created on 2023-12-04 with reprex v2.0.2