r-lib / clock

A Date-Time Library for R
https://clock.r-lib.org
Other
97 stars 4 forks source link

parsing year-quarter formats #300

Open TimTaylor opened 1 year ago

TimTaylor commented 1 year ago

lubridate's custom parser allows us to parse quarters with the %q format; do you think this functionality is something that will be added to clock? Currently I would pre-process in base R and then handle with clock but the string manipulation overhead does become noticeable for larger vectors compared to the C level parser utilised by lubridate. Example of functionality below:

options(lubridate.verbose = TRUE)
dat <- "2021q2"
lubridate::yq(dat)
#>  1 parsed with %Yq%q
#> [1] "2021-04-01"
lubridate::fast_strptime(dat, "%Yq%q")
#> [1] "2021-04-01 UTC"

Created on 2022-08-11 by the reprex package (v2.0.1)

DavisVaughan commented 1 year ago

Yea I'd like to add clock::year_quarter_day_parse() (like year_month_day_parse()) that would allow you to handle this, which you could then convert to date/posixct with as_date() or as_date_time()

DavisVaughan commented 1 year ago

I imagine this is probably the fastest way in the meantime

library(clock)
library(stringr)

dat <- c("2021q2", "2021q3")
dat <- str_split_fixed(dat, "q", 2)
dat
#>      [,1]   [,2]
#> [1,] "2021" "2" 
#> [2,] "2021" "3"

year <- as.integer(dat[, 1, drop = TRUE])
quarter <- as.integer(dat[, 2, drop = TRUE])

yq <- year_quarter_day(year, quarter)
yq
#> <year_quarter_day<January><quarter>[2]>
#> [1] "2021-Q2" "2021-Q3"

# Then if you need Date
as_date(set_day(yq, 1))
#> [1] "2021-04-01" "2021-07-01"

That method took 1.5 seconds with 2 million strings

TimTaylor commented 1 year ago

Cool - year_quarter_day_parse() would be great to have. For completeness, and comparison with yq(), here's the closest I think we can currently get without the additional parser (main difference from above being stringi over stringr):

library(stringi)
library(lubridate, include.only = "yq")
library(clock)
library(microbenchmark)

n <- 2000000L
yrs <- rep_len(1022:2022, n)
qtrs <- rep_len(1:4, n)
input <- sprintf("%dq%d", yrs, qtrs)

clocky <- function(x) {
    x <- stri_split_fixed(x, "q", n = 2L, simplify = TRUE)
    storage.mode(x) <- "integer"
    x <- year_quarter_day(x[,1L], x[,2L], 1L)
    as_date(x)
}

microbenchmark(yq(input), clocky(input), check = "identical")
#> Unit: milliseconds
#>           expr      min       lq     mean   median       uq      max neval
#>      yq(input) 162.3977 183.7156 208.0677 205.3677 224.6783 319.2763   100
#>  clocky(input) 701.2078 760.9052 849.4119 870.0091 935.9795 997.0503   100