Closed DavisVaughan closed 4 years ago
Thanks for sharing. Really nice to see a high quality and clean code like that.
There are a few considerations that make me cautious regarding the merger of he two packages:
timechange
is originally reserved for time manipulation and I never thought of adding accessors, In retrospect it makes sense though - where is time_update
there should be time_get
. As the api is relatively low level I don't want to add more specific time_year
etc.
timerip
accessors, if they are, it wouldn't make sense to merge things.timechange
is C++ with relatively heavy dependencies (Rcpp and external CCTZ) which require C++11. Timerip on the other hand is plain C and idiomatic R's C level api. The merger of these two worlds doesn't quite feel right conceptually and from the maintenance prospective. It's a loss of functionality on your side for sure.as.POSIXlt
once and reusing it for all components. So it's an open question how much speed-up one achieve by moving to C level accessors in a package like lubridate
for example.Ok, I have put it to the test. There is new function time_get
in the master.
I think something is not quite right in the current master of timerip
as I don't get the advertised speed-up over the as.POSIXlt
:
library(microbenchmark)
x <- as.POSIXct("2019-01-01", tz = "EST")
x <- rep(x, 1e5)
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
all = timechange::time_get(x),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 9.505887 9.648305 10.30370 9.924143 10.17600 13.79554 10
#> w 10.416269 10.587290 10.79499 10.753359 11.06534 11.35429 10
#> s 9.742561 9.907821 10.20154 10.158917 10.52584 10.71564 10
#> yhs 9.979397 10.287468 10.66156 10.517093 10.65762 12.56139 10
#> all 10.083030 10.978213 11.10903 11.016203 11.18269 12.59788 10
#> ry 26.959547 27.226135 27.59642 27.440485 28.00330 28.48657 10
#> rw 27.089332 27.368163 27.82962 27.625056 27.98834 29.13959 10
#> rs 27.339093 27.420826 28.24613 27.734330 28.42593 32.21353 10
#> rall 27.390400 27.981544 28.45941 28.240499 29.18063 29.90935 10
#> py 22.775456 23.734271 23.94684 23.842144 24.45706 25.21952 10
Created on 2019-12-03 by the reprex package (v0.3.0)
The real test is with random timestamps far from the origin, on which timerip
is extremely slow:
library(microbenchmark)
x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
tt <- timechange::time_get(x, c("year"))
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
all = timechange::time_get(x),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 22.16492 22.77200 22.98730 22.85186 23.30161 23.74395 10
#> w 23.29731 23.37836 24.17356 23.90144 24.43394 27.51863 10
#> s 22.68502 22.83982 23.08590 23.12552 23.22887 23.54638 10
#> yhs 22.79128 22.90198 23.55789 23.63533 23.99564 24.29647 10
#> all 23.00293 23.73563 24.04051 24.15607 24.31013 24.89092 10
#> ry 1340.99094 1346.55909 1353.36086 1353.24087 1358.39953 1366.57455 10
#> rw 1343.09022 1355.21451 1359.50746 1359.72767 1364.75275 1374.48283 10
#> rs 1348.51004 1349.60933 1354.53464 1354.85124 1356.29491 1364.76602 10
#> rall 1341.37043 1352.12772 1355.41210 1355.89797 1358.77877 1367.20003 10
#> py 52.38996 52.89831 53.40425 53.23054 53.92430 54.61009 10
Created on 2019-12-03 by the reprex package (v0.3.0)
What OS do you run on?
Ubuntu Linux. R.3.6.1
As an FYI, on my Mac I get this. R 3.6.1 as well.
There are 2 paths that can be taken in the R core code, controlled by a number of #define
s that are set at R build time, and then are unset after R is built. One path is faster than the other, I think. I had to manually set them using my best judgement, but I think that I may have set one incorrectly for linux.
library(microbenchmark)
x <- as.POSIXct("2019-01-01", tz = "EST")
x <- rep(x, 1e5)
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
all = timechange::time_get(x),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 8.239911 8.315272 8.647688 8.454186 8.554320 10.771905 10
#> w 9.078737 9.589134 11.662359 9.788919 13.077321 19.313010 10
#> s 8.447826 8.638082 8.798481 8.741196 8.862062 9.623898 10
#> yhs 8.505015 8.777103 9.242635 9.008297 9.267691 11.352840 10
#> all 11.209366 11.308421 12.235255 11.879028 12.296999 16.553384 10
#> ry 2.667725 2.747954 2.806370 2.791454 2.852032 2.934191 10
#> rw 2.632267 2.734911 2.889427 2.848997 2.894748 3.405966 10
#> rs 2.707203 2.776108 3.235224 2.902864 3.000395 6.437684 10
#> rall 3.048488 3.171785 3.575233 3.302906 3.976357 4.542910 10
#> py 6.908238 7.068018 8.745726 7.982962 8.416155 17.967891 10
library(microbenchmark)
x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
tt <- timechange::time_get(x, c("year"))
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
all = timechange::time_get(x),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 17.397301 17.640277 18.791252 18.158133 19.078707 24.375170 10
#> w 18.395116 18.502897 18.929968 18.646758 18.963397 21.191565 10
#> s 17.397180 17.794876 18.035202 17.879380 18.361592 18.904551 10
#> yhs 17.614212 17.838461 19.375760 18.238518 18.907996 27.585019 10
#> all 20.115869 21.211748 21.584361 21.573282 22.018677 23.470699 10
#> ry 7.395926 7.405355 7.720167 7.614273 7.736554 8.684259 10
#> rw 7.432677 7.454923 7.690756 7.623355 7.774921 8.482829 10
#> rs 7.436187 7.648532 8.158388 8.015909 8.595669 9.603361 10
#> rall 7.672439 7.831959 8.177107 8.010528 8.452318 9.228359 10
#> py 11.258847 11.520631 13.513530 13.156909 13.510754 17.867760 10
Created on 2019-12-03 by the reprex package (v0.3.0.9000)
Ok, master should be much faster now. There were a few more defines that were important for the linux path (which mac and windows do not take).
I'm still impressed with how much faster it is on Mac, but I think this is mainly a side effect of as.POSIXlt()
itself being faster on Mac (because it takes a different code path than Linux). timerip's main advantage is that it allocates less memory than as.POSIXlt()
because it only returns the requested components. It still has to do everything else as.POSIXlt()
does for each time point.
I think I'd rather use the clean code of timechange + cctz, rather than the hacks I have to do with the defines in timerip. I'll sacrifice the minor loss in speed on Mac / Windows for the clarity of the timechange + cctz code
These are on a linux machine using rstudio cloud running R 3.6.0
library(microbenchmark)
x <- as.POSIXct("2019-01-01", tz = "EST")
x <- rep(x, 1e5)
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
all = timechange::time_get(x),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 10.33122 10.40053 10.90243 10.48370 10.53850 14.93278 10
#> w 11.25051 11.42754 12.37583 11.47764 11.71305 19.49841 10
#> s 10.48388 10.49949 11.06680 10.72453 10.83986 14.70946 10
#> yhs 10.64473 10.70560 10.98846 10.75872 11.35871 11.54228 10
#> all 10.99305 11.05162 12.86075 11.66124 13.72362 19.00305 10
#> ry 19.08983 19.24744 21.73191 19.32687 27.29638 27.61661 10
#> rw 19.16762 19.26471 21.65013 19.29272 23.36653 27.19479 10
#> rs 19.18620 19.29554 21.94163 20.37703 22.33075 29.77025 10
#> rall 19.55985 19.57942 22.06606 21.02279 23.38845 27.62360 10
#> py 23.79847 23.95242 27.47313 26.09933 31.91405 34.63303 10
library(microbenchmark)
x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
tt <- timechange::time_get(x, c("year"))
#> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
#> had status 1
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
all = timechange::time_get(x),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 18.32006 18.34959 21.65287 18.75497 26.35889 26.43328 10
#> w 19.62843 19.70349 23.81573 23.82365 27.89134 28.03321 10
#> s 18.53282 18.57186 20.51319 18.62901 20.49316 26.72199 10
#> yhs 18.67558 18.74528 19.66308 18.85650 19.05841 26.81872 10
#> all 18.83336 19.04524 20.80003 19.30225 19.61294 27.12312 10
#> ry 21.22068 21.32387 24.38462 21.38685 29.37304 33.36503 10
#> rw 21.23682 21.27425 23.27597 21.42605 21.54439 33.35816 10
#> rs 21.39878 21.44119 23.65289 21.61233 25.61614 33.46295 10
#> rall 21.64612 21.90587 23.01110 22.47215 22.56772 29.87819 10
#> py 26.02348 26.04764 27.82952 26.72675 28.59333 35.14442 10
sacrifice the minor loss in speed on Mac / Windows
From your tests it's not really minor, 2.5x compared to cctz. But mostly, I am surprised that timechange is even slower than posix on your Mac.
Do you have a benchmark on Windows by any chance?
master should be much faster now.
Confirmed, problem solved. It's a bit strange that cctz for me is 2x faster than POSIX for the random case, but not for you.
library(microbenchmark)
x <- .POSIXct(runif(1e5, -17987443200, 32503680000)) # random times between 1400 and 3000
microbenchmark(y = timechange::time_get(x, "year"),
w = timechange::time_get(x, "wday"),
s = timechange::time_get(x, "second"),
yhs = timechange::time_get(x, c("year", "hour", "second")),
yhs_C = timechange:::C_time_get(x, c("year", "hour", "second")),
ry = timerip::rip_year(x),
rw = timerip::rip_wday(x),
rs = timerip::rip_second(x),
rall = timerip::rip_info(x),
py = as.POSIXlt(x)$year + 1900L,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> y 22.31338 22.53825 26.90498 23.11411 23.36045 59.84700 10
#> w 23.78350 24.34514 25.09953 24.80864 25.62998 27.49699 10
#> s 22.69449 22.88087 24.28282 23.25037 23.88304 32.11217 10
#> yhs 23.15676 23.31659 24.09488 23.83145 24.85213 25.64840 10
#> yhs_C 22.63915 22.96206 23.93158 23.75540 23.97445 28.04917 10
#> ry 48.05492 48.79874 52.35727 50.66284 53.08482 68.48755 10
#> rw 47.36542 47.56351 48.75848 47.85482 49.65152 51.81008 10
#> rs 47.68191 47.80944 51.16628 48.76389 50.48662 71.44718 10
#> rall 47.78194 48.71277 49.46981 49.24084 50.16917 52.66800 10
#> py 53.49913 53.73863 55.11080 54.18312 55.55905 60.79190 10
Created on 2019-12-03 by the reprex package (v0.3.0)
> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
> had status 1
Hmm, never seen this one before. Your local hackery?
I've been working on exposing the guts of
as.POSIXlt()
in a more piecewise manner, so you have direct access to the year / month / etc components without the memory overhead of the other pieces.It's located here: https://github.com/DavisVaughan/timerip
The reference page is an easy place to start https://davisvaughan.github.io/timerip/reference/index.html
@hadley mentioned this would probably fit better here in timechange. I figured I'd open this issue as a starting place for talking about how that could happen / what the names of these should be if they are included in timechange instead
Possibly just
rip_year() -> time_year()
Notes:
datetime.c
,datetime.h
,localtime.c
,tzfile.h
and finallygetTZinfo.c
which extracts justgetTZinfo()
from a much larger file.rip_info()
is basicallyas.POSIXlt()
, but it gives the information back in the most useful form (1-based, andyear
would return2019
, not2019-1900=119
)