Pulling over components of timerip

I've been working on exposing the guts of as.POSIXlt() in a more piecewise manner, so you have direct access to the year / month / etc components without the memory overhead of the other pieces.

It's located here: https://github.com/DavisVaughan/timerip

The reference page is an easy place to start https://davisvaughan.github.io/timerip/reference/index.html

@hadley mentioned this would probably fit better here in timechange. I figured I'd open this issue as a starting place for talking about how that could happen / what the names of these should be if they are included in timechange instead

Possibly just rip_year() -> time_year()

Notes:

There is even an exposed C api with functions of the exact same name, so other packages can call these directly from their own C code.
It requires the inclusion and modification of a few files from R itself. datetime.c, datetime.h, localtime.c, tzfile.h and finally getTZinfo.c which extracts just getTZinfo() from a much larger file.
rip_info() is basically as.POSIXlt(), but it gives the information back in the most useful form (1-based, and year would return 2019, not 2019-1900=119)
Otherwise, no dependencies

Thanks for sharing. Really nice to see a high quality and clean code like that.

There are a few considerations that make me cautious regarding the merger of he two packages:

timechange is originally reserved for time manipulation and I never thought of adding accessors, In retrospect it makes sense though - where is time_update there should be time_get. As the api is relatively low level I don't want to add more specific time_year etc.
- It should be checked if plain cctz accessors would be comparable in speed to the timerip accessors, if they are, it wouldn't make sense to merge things.
- timechange is C++ with relatively heavy dependencies (Rcpp and external CCTZ) which require C++11. Timerip on the other hand is plain C and idiomatic R's C level api. The merger of these two worlds doesn't quite feel right conceptually and from the maintenance prospective. It's a loss of functionality on your side for sure.
- Finally, as practice with lubridate suggests, often multiple compoenets are required at the same time. So we end up doing as.POSIXlt once and reusing it for all components. So it's an open question how much speed-up one achieve by moving to C level accessors in a package like lubridate for example.

Ok, I have put it to the test. There is new function time_get in the master.

I think something is not quite right in the current master of timerip as I don't get the advertised speed-up over the as.POSIXlt:


library(microbenchmark)

x <- as.POSIXct("2019-01-01", tz = "EST")
x <- rep(x, 1e5)

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               all = timechange::time_get(x),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>  expr       min        lq     mean    median       uq      max neval
#>     y  9.505887  9.648305 10.30370  9.924143 10.17600 13.79554    10
#>     w 10.416269 10.587290 10.79499 10.753359 11.06534 11.35429    10
#>     s  9.742561  9.907821 10.20154 10.158917 10.52584 10.71564    10
#>   yhs  9.979397 10.287468 10.66156 10.517093 10.65762 12.56139    10
#>   all 10.083030 10.978213 11.10903 11.016203 11.18269 12.59788    10
#>    ry 26.959547 27.226135 27.59642 27.440485 28.00330 28.48657    10
#>    rw 27.089332 27.368163 27.82962 27.625056 27.98834 29.13959    10
#>    rs 27.339093 27.420826 28.24613 27.734330 28.42593 32.21353    10
#>  rall 27.390400 27.981544 28.45941 28.240499 29.18063 29.90935    10
#>    py 22.775456 23.734271 23.94684 23.842144 24.45706 25.21952    10

^{Created on 2019-12-03 by the reprex package (v0.3.0)}

The real test is with random timestamps far from the origin, on which timerip is extremely slow:


library(microbenchmark)

x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
tt <- timechange::time_get(x, c("year"))

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               all = timechange::time_get(x),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>  expr        min         lq       mean     median         uq        max neval
#>     y   22.16492   22.77200   22.98730   22.85186   23.30161   23.74395    10
#>     w   23.29731   23.37836   24.17356   23.90144   24.43394   27.51863    10
#>     s   22.68502   22.83982   23.08590   23.12552   23.22887   23.54638    10
#>   yhs   22.79128   22.90198   23.55789   23.63533   23.99564   24.29647    10
#>   all   23.00293   23.73563   24.04051   24.15607   24.31013   24.89092    10
#>    ry 1340.99094 1346.55909 1353.36086 1353.24087 1358.39953 1366.57455    10
#>    rw 1343.09022 1355.21451 1359.50746 1359.72767 1364.75275 1374.48283    10
#>    rs 1348.51004 1349.60933 1354.53464 1354.85124 1356.29491 1364.76602    10
#>  rall 1341.37043 1352.12772 1355.41210 1355.89797 1358.77877 1367.20003    10
#>    py   52.38996   52.89831   53.40425   53.23054   53.92430   54.61009    10

^{Created on 2019-12-03 by the reprex package (v0.3.0)}

What OS do you run on?

Ubuntu Linux. R.3.6.1

As an FYI, on my Mac I get this. R 3.6.1 as well.

There are 2 paths that can be taken in the R core code, controlled by a number of #defines that are set at R build time, and then are unset after R is built. One path is faster than the other, I think. I had to manually set them using my best judgement, but I think that I may have set one incorrectly for linux.

library(microbenchmark)

x <- as.POSIXct("2019-01-01", tz = "EST")
x <- rep(x, 1e5)

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               all = timechange::time_get(x),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>  expr       min        lq      mean    median        uq       max neval
#>     y  8.239911  8.315272  8.647688  8.454186  8.554320 10.771905    10
#>     w  9.078737  9.589134 11.662359  9.788919 13.077321 19.313010    10
#>     s  8.447826  8.638082  8.798481  8.741196  8.862062  9.623898    10
#>   yhs  8.505015  8.777103  9.242635  9.008297  9.267691 11.352840    10
#>   all 11.209366 11.308421 12.235255 11.879028 12.296999 16.553384    10
#>    ry  2.667725  2.747954  2.806370  2.791454  2.852032  2.934191    10
#>    rw  2.632267  2.734911  2.889427  2.848997  2.894748  3.405966    10
#>    rs  2.707203  2.776108  3.235224  2.902864  3.000395  6.437684    10
#>  rall  3.048488  3.171785  3.575233  3.302906  3.976357  4.542910    10
#>    py  6.908238  7.068018  8.745726  7.982962  8.416155 17.967891    10

library(microbenchmark)

x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
tt <- timechange::time_get(x, c("year"))

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               all = timechange::time_get(x),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>  expr       min        lq      mean    median        uq       max neval
#>     y 17.397301 17.640277 18.791252 18.158133 19.078707 24.375170    10
#>     w 18.395116 18.502897 18.929968 18.646758 18.963397 21.191565    10
#>     s 17.397180 17.794876 18.035202 17.879380 18.361592 18.904551    10
#>   yhs 17.614212 17.838461 19.375760 18.238518 18.907996 27.585019    10
#>   all 20.115869 21.211748 21.584361 21.573282 22.018677 23.470699    10
#>    ry  7.395926  7.405355  7.720167  7.614273  7.736554  8.684259    10
#>    rw  7.432677  7.454923  7.690756  7.623355  7.774921  8.482829    10
#>    rs  7.436187  7.648532  8.158388  8.015909  8.595669  9.603361    10
#>  rall  7.672439  7.831959  8.177107  8.010528  8.452318  9.228359    10
#>    py 11.258847 11.520631 13.513530 13.156909 13.510754 17.867760    10

^{Created on 2019-12-03 by the reprex package (v0.3.0.9000)}

Ok, master should be much faster now. There were a few more defines that were important for the linux path (which mac and windows do not take).

I'm still impressed with how much faster it is on Mac, but I think this is mainly a side effect of as.POSIXlt() itself being faster on Mac (because it takes a different code path than Linux). timerip's main advantage is that it allocates less memory than as.POSIXlt() because it only returns the requested components. It still has to do everything else as.POSIXlt() does for each time point.

I think I'd rather use the clean code of timechange + cctz, rather than the hacks I have to do with the defines in timerip. I'll sacrifice the minor loss in speed on Mac / Windows for the clarity of the timechange + cctz code

These are on a linux machine using rstudio cloud running R 3.6.0

library(microbenchmark)

x <- as.POSIXct("2019-01-01", tz = "EST")
x <- rep(x, 1e5)

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               all = timechange::time_get(x),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>     y 10.33122 10.40053 10.90243 10.48370 10.53850 14.93278    10
#>     w 11.25051 11.42754 12.37583 11.47764 11.71305 19.49841    10
#>     s 10.48388 10.49949 11.06680 10.72453 10.83986 14.70946    10
#>   yhs 10.64473 10.70560 10.98846 10.75872 11.35871 11.54228    10
#>   all 10.99305 11.05162 12.86075 11.66124 13.72362 19.00305    10
#>    ry 19.08983 19.24744 21.73191 19.32687 27.29638 27.61661    10
#>    rw 19.16762 19.26471 21.65013 19.29272 23.36653 27.19479    10
#>    rs 19.18620 19.29554 21.94163 20.37703 22.33075 29.77025    10
#>  rall 19.55985 19.57942 22.06606 21.02279 23.38845 27.62360    10
#>    py 23.79847 23.95242 27.47313 26.09933 31.91405 34.63303    10

library(microbenchmark)

x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
tt <- timechange::time_get(x, c("year"))
#> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
#> had status 1

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               all = timechange::time_get(x),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>     y 18.32006 18.34959 21.65287 18.75497 26.35889 26.43328    10
#>     w 19.62843 19.70349 23.81573 23.82365 27.89134 28.03321    10
#>     s 18.53282 18.57186 20.51319 18.62901 20.49316 26.72199    10
#>   yhs 18.67558 18.74528 19.66308 18.85650 19.05841 26.81872    10
#>   all 18.83336 19.04524 20.80003 19.30225 19.61294 27.12312    10
#>    ry 21.22068 21.32387 24.38462 21.38685 29.37304 33.36503    10
#>    rw 21.23682 21.27425 23.27597 21.42605 21.54439 33.35816    10
#>    rs 21.39878 21.44119 23.65289 21.61233 25.61614 33.46295    10
#>  rall 21.64612 21.90587 23.01110 22.47215 22.56772 29.87819    10
#>    py 26.02348 26.04764 27.82952 26.72675 28.59333 35.14442    10

sacrifice the minor loss in speed on Mac / Windows

From your tests it's not really minor, 2.5x compared to cctz. But mostly, I am surprised that timechange is even slower than posix on your Mac.

Do you have a benchmark on Windows by any chance?

master should be much faster now.

Confirmed, problem solved. It's a bit strange that cctz for me is 2x faster than POSIX for the random case, but not for you.


library(microbenchmark)

x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000

microbenchmark(y = timechange::time_get(x, "year"),
               w = timechange::time_get(x, "wday"),
               s = timechange::time_get(x, "second"),
               yhs = timechange::time_get(x, c("year", "hour", "second")),
               yhs_C = timechange:::C_time_get(x, c("year", "hour", "second")),
               ry = timerip::rip_year(x),
               rw = timerip::rip_wday(x),
               rs = timerip::rip_second(x),
               rall = timerip::rip_info(x),
               py = as.POSIXlt(x)$year + 1900L,
               times = 10)
#> Unit: milliseconds
#>   expr      min       lq     mean   median       uq      max neval
#>      y 22.31338 22.53825 26.90498 23.11411 23.36045 59.84700    10
#>      w 23.78350 24.34514 25.09953 24.80864 25.62998 27.49699    10
#>      s 22.69449 22.88087 24.28282 23.25037 23.88304 32.11217    10
#>    yhs 23.15676 23.31659 24.09488 23.83145 24.85213 25.64840    10
#>  yhs_C 22.63915 22.96206 23.93158 23.75540 23.97445 28.04917    10
#>     ry 48.05492 48.79874 52.35727 50.66284 53.08482 68.48755    10
#>     rw 47.36542 47.56351 48.75848 47.85482 49.65152 51.81008    10
#>     rs 47.68191 47.80944 51.16628 48.76389 50.48662 71.44718    10
#>   rall 47.78194 48.71277 49.46981 49.24084 50.16917 52.66800    10
#>     py 53.49913 53.73863 55.11080 54.18312 55.55905 60.79190    10

^{Created on 2019-12-03 by the reprex package (v0.3.0)}

> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'

> had status 1

Hmm, never seen this one before. Your local hackery?

vspinu / timechange

Pulling over components of timerip #9

> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'

> had status 1