r-lib / clock

A Date-Time Library for R
https://clock.r-lib.org
Other
102 stars 5 forks source link

zoned_time convert to naive_time with timezone vector #301

Closed eitsupi closed 2 years ago

eitsupi commented 2 years ago

Related to tidyverse/lubridate#1063

Thank you for developing this wonderful package. This package seems useful for complex processing with respect to time, but is it possible to convert timestamps with time zone to local time in each region without time zone?

What I would like to do is the following process, but so far we cannot vectorize it.

df <- readr::read_csv(I("
id,timestamp,timezone
1,2019-01-01T00:00:00Z,UTC
2,2019-01-01T00:00:00Z,Asia/Tokyo
3,2019-01-01T20:00:00Z,UTC
4,2019-01-01T20:00:00Z,Asia/Tokyo
"), show_col_types = FALSE)

df |>
  dplyr::mutate(
    timestamp = clock::as_zoned_time(timestamp)
  ) |>
  dplyr::rowwise() |>
  dplyr::summarise(
    id = id,
    local_timestamp = clock::zoned_time_set_zone(timestamp, timezone) |>
      clock::as_naive_time()
  ) |>
  dplyr::right_join(df, by = "id")
#> # A tibble: 4 × 4
#>      id local_timestamp     timestamp           timezone
#>   <dbl> <tp<naive><second>> <dttm>              <chr>
#> 1     1 2019-01-01T00:00:00 2019-01-01 00:00:00 UTC
#> 2     2 2019-01-01T09:00:00 2019-01-01 00:00:00 Asia/Tokyo
#> 3     3 2019-01-01T20:00:00 2019-01-01 20:00:00 UTC
#> 4     4 2019-01-02T05:00:00 2019-01-01 20:00:00 Asia/Tokyo

Created on 2022-09-13 with reprex v2.0.2

DavisVaughan commented 2 years ago

It is possible, you are looking for sys_time_info(). That and naive_time_info() are the only functions in clock where it makes sense to have a vectorized zone argument.

Since your original timestamp values are in UTC, you can convert them straight to sys-time. Then you can use sys_time_info() on that by also providing your vector of time zones. That gives you a data frame with a lot of information back, but really what you are about is the offset from UTC. Adding that offset to the sys-time gives you the "local time", and it is good practice to then convert that to a naive-time (because it is no longer UTC)

It should be very fast because it is vectorized.

See also https://stackoverflow.com/questions/73241828/how-to-convert-a-column-of-utc-timestamps-into-several-different-timezones/73282728#73282728

library(dplyr)
library(clock)

df <- readr::read_csv(I("
id,timestamp,timezone
1,2019-01-01T00:00:00Z,UTC
2,2019-01-01T00:00:00Z,Asia/Tokyo
3,2019-01-01T20:00:00Z,UTC
4,2019-01-01T20:00:00Z,Asia/Tokyo
"), show_col_types = FALSE)

df <- df %>%
  mutate(sys_time = as_sys_time(timestamp), .keep = "unused")

df
#> # A tibble: 4 × 3
#>      id timezone   sys_time           
#>   <dbl> <chr>      <clck_sy_>         
#> 1     1 UTC        2019-01-01T00:00:00
#> 2     2 Asia/Tokyo 2019-01-01T00:00:00
#> 3     3 UTC        2019-01-01T20:00:00
#> 4     4 Asia/Tokyo 2019-01-01T20:00:00

# All the info you get from `sys_time_info()`.
# You need `offset`.
sys_time_info(df$sys_time, df$timezone)
#>                   begin                  end offset   dst abbreviation
#> 1 -32767-01-01T00:00:00 32767-12-31T00:00:00      0 FALSE          UTC
#> 2   1951-09-08T15:00:00 32767-12-31T00:00:00  32400 FALSE          JST
#> 3 -32767-01-01T00:00:00 32767-12-31T00:00:00      0 FALSE          UTC
#> 4   1951-09-08T15:00:00 32767-12-31T00:00:00  32400 FALSE          JST

df %>%
  mutate(
    offset = sys_time_info(sys_time, timezone)$offset,
    naive_time = as_naive_time(sys_time + offset)
  )
#> # A tibble: 4 × 5
#>      id timezone   sys_time                   offset naive_time         
#>   <dbl> <chr>      <clck_sy_>          <dur<second>> <clck_nv_>         
#> 1     1 UTC        2019-01-01T00:00:00             0 2019-01-01T00:00:00
#> 2     2 Asia/Tokyo 2019-01-01T00:00:00         32400 2019-01-01T09:00:00
#> 3     3 UTC        2019-01-01T20:00:00             0 2019-01-01T20:00:00
#> 4     4 Asia/Tokyo 2019-01-01T20:00:00         32400 2019-01-02T05:00:00

Created on 2022-09-13 with reprex v2.0.2

eitsupi commented 2 years ago

Thanks for the quick and detailed response. This is great! Also, thank you for linking to Stack Overflow. I did a search and saw some older answers but did not get to it.

I am excited about the features of this package, but the many concepts and large number of functions in this package (I was overwhelmed by the length of the reference page......) make it seem difficult for a novice to write such a process. Is it a non-goal of this package to have such a function? (For example, is it a prospect to adopt clock as a backend for a package like lubridate in the future and implement it in that package?)

df <- readr::read_csv(I("
id,timestamp,timezone
1,2019-01-01T00:00:00Z,UTC
2,2019-01-01T00:00:00Z,Asia/Tokyo
3,2019-01-01T20:00:00Z,UTC
4,2019-01-01T20:00:00Z,Asia/Tokyo
"), show_col_types = FALSE)

.at_time_zone <- function(x, tz) {
  x <- clock::as_sys_time(x)
  offset <- clock::sys_time_info(x, tz)$offset
  clock::as_naive_time(x + offset) |>
    as.POSIXct()
}

df |>
  dplyr::mutate(
    local_timestamp = .at_time_zone(timestamp, timezone)
  )
#> # A tibble: 4 × 4
#>      id timestamp           timezone   local_timestamp
#>   <dbl> <dttm>              <chr>      <dttm>
#> 1     1 2019-01-01 00:00:00 UTC        2019-01-01 00:00:00
#> 2     2 2019-01-01 00:00:00 Asia/Tokyo 2019-01-01 09:00:00
#> 3     3 2019-01-01 20:00:00 UTC        2019-01-01 20:00:00
#> 4     4 2019-01-01 20:00:00 Asia/Tokyo 2019-01-02 05:00:00

Created on 2022-09-13 with reprex v2.0.2

DavisVaughan commented 2 years ago

The problem is that your local_timestamp column has a time zone on it that is guaranteed to be wrong.

Assuming that the time zone on that local_timestamp column is UTC, that is wrong for the 2nd row because that is showing the local time in Asia/Tokyo, not the local time in UTC. That's why I used naive-time as my output type, it is a date-time type with a yet-to-be-specified time zone.

Because there is no way a vector can have multiple time zones, you can't provide a helper for this that returns a POSIXct, so there is no way to push this up into lubridate.

This is a fairly specialized operation, so I'm not worried about it requiring clock

eitsupi commented 2 years ago

Assuming that the time zone on that local_timestamp column is UTC, that is wrong for the 2nd row because that is showing the local time in Asia/Tokyo, not the local time in UTC. That's why I used naive-time as my output type, it is a date-time type with a yet-to-be-specified time zone.

Yes, it is definitely a compromise that that column holds timezone information that shouldn't be there......