joshuaulrich / xts

Extensible time series class that provides uniform handling of many R time series classes by extending zoo.
http://joshuaulrich.github.io/xts/
GNU General Public License v2.0
220 stars 71 forks source link

findInterval in window_idx #282

Closed harvey131 closed 5 years ago

harvey131 commented 5 years ago

Description

window_idx uses findInterval() on a vector of POSIXt objects. findInterval is slower when passed POSIXt objects. You could try converting the arguments to numeric first, or use a simple Rcpp function.

Minimal, reproducible example

Rcpp::cppFunction('IntegerVector findInterval2(const NumericVector x, const NumericVector vec) 
{ 
   IntegerVector xout(x.size());
   for (int i = 0; i < x.size(); ++i) {
     if (!NumericVector::is_na(x[i])) xout[i] = std::upper_bound(vec.begin(), vec.end(), x[i]) - vec.begin();
   }
   return xout;
}')
# vector of integers
x = c(123,1,5,121,200,3,-13,11111)
vec = -1e6:1e6L
# vector of POSIXct
x2 <- structure(x, class = c("POSIXct", "POSIXt"), tzone = '')
vec2 <- structure(vec, class = c("POSIXct", "POSIXt"), tzone = '')
# microbenchmark
my_check <- function(values) { all(sapply(values[-1], function(x) identical(values[[1]], x))) }
microbenchmark::microbenchmark(
  # on numeric R and Rcpp are similar
  findInterval(x, vec),
  findInterval2(x, vec),
  # on POSIXt Rcpp is faster than R
  findInterval(x2, vec2),
  findInterval(as.numeric(x2), as.numeric(vec2)), # convert to numeric first
  findInterval2(x2, vec2), # Rcpp
  times=100, check = my_check)

Session Info

> version                      
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray               
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] RevoUtils_11.0.1     RevoUtilsMath_11.0.0
loaded via a namespace (and not attached):
[1] microbenchmark_1.4-4 compiler_3.5.1       tools_3.5.1          Rcpp_0.12.18        
joshuaulrich commented 5 years ago

Thanks for the suggestion! This is interesting, but I'm not sure how much of a difference it would make to the entire window_idx() function. Have you run your benchmarks after replacing the findInterval() call in window_idx() with your findInterval2() function?

joshuaulrich commented 5 years ago

From my investigation, I don't see much benefit from the Rcpp code relative to findInterval() in window_idx(). Timings are similar once you account for the is.unsorted() check in findInterval().

Also, the findInterval() call in window_idx() already passes a numeric vec (the underlying index attribute of the xts object, i.e. the UTC Unix time). So your benchmarks appear to be measuring something that doesn't currently occur.

findInterval() appears to be faster for larger x objects. IIRC, this is because findInterval() does some very fancy caching of the last location found and uses that value when looking for the next value in x.

I'm going to close this now, since it doesn't look like there's much performance gain to be gained from any changes. We can re-open if I've missed something.

Here is my code and benchmarks:

# Add sorted check from findInterval() to findInterval2()
findInterval3 <- findInterval2
body(findInterval3) <- bquote({ .(body(findInterval)[-3])
                                .(body(findInterval3)) })

mb <- microbenchmark::microbenchmark
window_idx <- xts:::window_idx
v <- -1e6:1e6L * 1.0

set.seed(21)
i <- sort(1.0 * sample(2e6, 1e4) - 1e6)
p <- .POSIXct(i)

# Similar timings for findInterval() and findInterval3() for small 'x'
mb(findInterval(p, v), findInterval2(p, v), findInterval3(p, v), times = 3)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval
#   findInterval(p, v) 7.848964 7.923046 7.954549 7.997129 8.007342 8.017554     3
#  findInterval2(p, v) 1.853770 1.884019 1.931684 1.914268 1.970642 2.027015     3
#  findInterval3(p, v) 7.786190 7.991259 8.072712 8.196328 8.215973 8.235618     3

set.seed(21)
i <- sort(1.0 * sample(2e6, 1e6) - 1e6)
p <- .POSIXct(i)

# findInterval() much faster than findInterval3() for larger 'x'
mb(findInterval(p, v), findInterval2(p, v), findInterval3(p, v), times = 3)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval
#   findInterval(p, v) 21.65893 21.72044 21.82681 21.78195 21.91074 22.03954     3
#  findInterval2(p, v) 74.44354 74.74377 75.23732 75.04399 75.63420 76.22441     3
#  findInterval3(p, v) 80.47909 80.82728 81.05312 81.17547 81.34014 81.50481     3
harvey131 commented 5 years ago

The difference in findInterval will not be apparent if the vec "v" argument is a vector of integers instead of POSIX as in your example. If you put aside the Rcpp implementation, and use .Internal(findInterval,...) you can see a benefit to not call is.unsorted twice.

We can create a version of window_idx with an additional argument implementation, which will allow us to select which findInterval we use, I can see a difference in window_idx. (see window_idx2 function at bottom) We can separately benchmark x[i,] and see that it is similar to window_idx(x, i), so we would expect an improvement in window_idx(x, i) to help when subsetting and the index. argument is time based.

# create xts 
n <- 1e6L
x <- .xts(runif(n, 1, n), 1:n)
# index is a vector of POSIXt that do not exist in index(x)
i = sample(index(x), 100) - 1
microbenchmark::microbenchmark(
  xts:::window_idx(x=x , index. = i),
  window_idx2(x=x , index. = i, implementation=1),
  window_idx2(x=x , index. = i, implementation=2),
  times=100, check = my_check, unit='ms')

Unit: milliseconds
                                               expr      min       lq     mean   median       uq       max neval
                xts:::window_idx(x = x, index. = i) 2.071561 2.115609 2.877276 2.259870 2.326585 10.159936   100
 window_idx2(x = x, index. = i, implementation = 1) 2.064433 2.115324 3.098633 2.254026 2.312900 11.293790   100
 window_idx2(x = x, index. = i, implementation = 2) 1.147256 1.180756 1.766360 1.228652 1.325446  9.157513   100

is.unsorted is called twice, once in window_idx and then again in findInterval. However, is.unsorted() is very slow when x is a POSIXct vector as opposed to a numeric vector.

# vector of integers
x = c(123.5,1.1,5.01,121.9,200,3,-13.1,11111)
vec = -1e6:1e6L
# vector of POSIXct
x2 <- structure(x, class = c("POSIXct", "POSIXt"), tzone = '')
vec2 <- structure(vec, class = c("POSIXct", "POSIXt"), tzone = '') + 1e-3
my_check <- function(values) { all(sapply(values[-1], function(x) identical(values[[1]], x))) }
microbenchmark::microbenchmark(
  is.unsorted(vec),
  is.unsorted(vec2),
  .Internal(is.unsorted(vec, FALSE)),
  .Internal(is.unsorted(vec2, FALSE)),
  !xts:::isOrdered(vec2),
  times=100, check = my_check, unit='ms')

Unit: milliseconds
                                expr      min        lq       mean   median        uq       max neval
                    is.unsorted(vec) 0.000286 0.0011415 0.00506447 0.006273 0.0071285  0.010835   100
                   is.unsorted(vec2) 5.572912 5.8915145 7.00569718 6.072412 6.1874510 17.140685   100
  .Internal(is.unsorted(vec, FALSE)) 0.000000 0.0000020 0.00156908 0.001997 0.0025665  0.004563   100
 .Internal(is.unsorted(vec2, FALSE)) 1.650178 1.7398430 1.80685917 1.793442 1.8540270  2.496934   100
              !xts:::isOrdered(vec2) 2.618388 3.0111180 5.10274515 3.136135 7.0064105 16.162781   100
window_idx2 = function (x, index. = NULL, start = NULL, end = NULL, implementation) 
{
  if (is.null(index.)) {
    usr_idx <- FALSE
    index. <- .index(x)
  }
  else {
    usr_idx <- TRUE
    idx <- .index(x)
    index. <- xts:::.toPOSIXct(index., indexTZ(x))
    index. <- index.[!is.na(index.)]
    if (is.unsorted(index.)) {
      index. <- sort(index.)
    }

    if (implementation == 1) {
      base_idx <- findInterval(index., idx)
    } else if (implementation == 2) {
      base_idx <- .Internal(findInterval(as.double(idx), as.double(index.), F, F, F))
    }

    base_idx <- pmax(base_idx, 1L)
    match <- idx[base_idx] == index.
    base_idx <- base_idx[match]
    index. <- index.[match]
    if (length(base_idx) < 1) 
      return(x[NULL, ])
  }
  if (!is.null(start)) {
    start <- .toPOSIXct(start, indexTZ(x))
  }
  if (!is.null(end)) {
    end <- .toPOSIXct(end, indexTZ(x))
  }
  firstlast <- xts:::index_bsearch(index., start, end)
  if (usr_idx && !is.null(firstlast)) {
    tmp <- base_idx[firstlast]
    res <- .Call("fill_window_dups_rev", tmp, .index(x), PACKAGE = "xts")
    firstlast <- rev(res)
  }
  firstlast
}
joshuaulrich commented 5 years ago

Thanks for your thoughts. I appreciate your analysis. The major hurdle is that we cannot include calls to .Internal() for CRAN packages. It is against the CRAN policy.

I'm a bit surprised that the calls to is.unsorted() are so expensive. The new ALTREP structure is supposed to keep track of whether a vector is sorted and/or has any NA values. That should make the second call to is.unsorted() a no-op. It looks like that may be true for regular vectors, but not classed vectors.

joshuaulrich commented 5 years ago

Thanks for bringing this up on R-devel (unsorted - suggestion for performance improvement and ALTREP support for POSIXct) @harvey131!