rbchan / unmarked

R package for hierarchical models in ecological research
https://rbchan.github.io/unmarked/
37 stars 25 forks source link

Suggested changes in predict() for 50x speedup #211

Closed jniedballa closed 1 year ago

jniedballa commented 2 years ago

Dear unmarked maintainers, predict() can be very slow, which is most likely due to it looping over the rows of the prediction data frame one by one. I suggest a modification to predict which in my tests achieved a ~50x speedup compared to the current method (in a single-species single-season occupancy model). Example below.

Create data and model

Let's use the crossbill example (modified to make it a single-season model for demonstration:

data(crossbill)
y.cross <- as.matrix(crossbill[,5:31])
umf <- unmarkedFrameOccu(y=y.cross,
                         siteCovs=crossbill[,2:3])
(fm <- occu(~ 1 ~ ele + forest, umf))

And a prediction data frame with 10000 rows:

nrows <- 10000
df_predict <- data.frame(intercept = 1, 
                         ele = seq(0, 2200, length.out = nrows),
                         forest = seq(0, 100, length.out = nrows))

predict()

Normal predict() takes about 15 seconds:

 system.time(
 out_pred <- predict(fm, df_predict, type = "state")
 )
  doing row 1000 of 10000 
  doing row 2000 of 10000 
...
  doing row 10000 of 10000 
   user  system elapsed 
  14.91    0.00   15.28 

vectorized (slow)

All functions that are used inside the prediction loop accept vectorized input (matrices), so it is possible to avoid the loop. With many rows, however, it becomes extremely slow. Example:

nrows <- 200
system.time({
# microbenchmark({
  df_predict <- as.matrix(data.frame(intercept = 1, 
                                     ele = seq(0, 2200, length.out = nrows),
                                     forest = seq(0, 100, length.out = nrows)))

  lc <- linearComb(fm, df_predict, type = "state")
  lc <- backTransform(lc)

  out <- data.frame(matrix(NA, nrow(df_predict), 4,
                           dimnames=list(NULL, c("Predicted", "SE", "lower", "upper"))))
  out$Predicted <- coef(lc)
  out$SE <- SE(lc)
  ci <- confint(lc, level=0.95)
  out$lower <- ci[,1]
  out$upper <- ci[,2]
})

by modifying nrows, I get:

So there are considerable gains to be made by predicting on batches instead of individual rows, but too many rows make it slow.

vectorized (fast)

To take advantage of the speed gains with vectorization while avoiding low performance on many rows I suggest splitting the prediction data frame into batches and looping over these batches. Example:

# batchwise predict method
 level <- 0.95
 batch_size <- 50   # optimum seems to be between 50 and 100

# recreate the prediction data frame with 10000 rows
nrows <- 10000
df_predict <- data.frame(intercept = 1, 
                         ele = seq(0, 2200, length.out = nrows),
                         forest = seq(0, 100, length.out = nrows))

 # split prediction data frame into batch-sized data frames (in a list)
 df_predict_grouping <- rep(seq(1, ceiling(nrow(df_predict) / batch_size)), each = batch_size)
 df_predict_split <- split(df_predict, df_predict_grouping)    
 df_predict_split <- lapply(df_predict_split, as.matrix)

 system.time({ 
 out <- list()

 # loop over batch-sized data frames
 for(i in 1:length(df_predict_split)){   
 lc <- linearComb(fm, df_predict_split[[i]], type = "state")
 lc <- backTransform(lc)

 out_tmp <- data.frame(matrix(NA, nrow(df_predict_split[[i]]), 4,
                          dimnames=list(NULL, c("Predicted", "SE", "lower", "upper"))))
 out_tmp$Predicted <- coef(lc)
 out_tmp$SE <- SE(lc)
 ci <- confint(lc, level=level)
 out_tmp$lower <- ci[,1]
 out_tmp$upper <- ci[,2]

 out[[i]] <- out_tmp
 }
out <- do.call(rbind, out)
 })
   user  system elapsed 
   0.26    0.00    0.26 

Results are identical:

all.equal(out, out_pred)
[1] TRUE

but it only took 0.26 sec instead of 15 sec. It seemed to be fastest with batch sizes between 50 and 100.

I imagine the method should generalize well to rasters and other model classes, but I didn't test that. split() currently gives a (benign) warning if the last batch is smaller than the previous ones. It's just for demonstration after all.

Code above is based on https://github.com/rbchan/unmarked/blob/master/R/unmarkedFit.R, lines 387-404.

Any way of speeding up predict() would be welcome. As much as I love unmarked, predict() being very slow has always been my main problem with the package (especially when predicting on rasters). Thank you for your work on the package. Best regards, Jürgen

rbchan commented 2 years ago

@jniedballa Thank you for the suggested improvement. @kenkellner Is this something you could make time for? I agree that predict is too slow right now, and it fails with huge rasters. It seems like the batch processing would work well, and it could be done in parallel.

kenkellner commented 2 years ago

Thanks @jniedballa, this is very helpful. Definitely agreed on the speed issues with predict. I'll look into this. It doesn't really make sense to me why it is faster to do the linearComb - backTransform - SE sequence in batches instead of all at once. There's a series of matrix multiplications happening in in there - maybe there's an issue with those once the number of rows gets large enough?

kenkellner commented 2 years ago

It does seem to be the matrix multiplication in linearComb:

https://github.com/rbchan/unmarked/blob/08f47c4eef1bd9fc8011ed4e523cbffdfd46525c/R/unmarkedEstimate.R#L178

Batching this is faster than doing the multiplication all at once. It doesn't help to divide the calculation up into parts either. I tried doing the multiplication in C++ instead and it was even slower (not surprising I guess, I'm sure base R is doing matrix calculations in C)

I think it makes the most sense to implement the batching loop in linearComb and not in predict since there are many individual predict methods but most of them call linearComb at some point.

jniedballa commented 2 years ago

I'm glad you're positive about the suggestion.

You would know best how to implement it, the multitude of methods in unmarked and how they are all implemented is a bit over my head. Your idea sounds reasonable though.

But even if you modify linearComb to loop over batches, you'd still need to change every predict method since currently they all loop over the rows of the prediction data frame which wouldn't be necessary anymore.

And of course there also needs to be a way of taking care of rows with NAs. In the suggested method a single NA in the prediction data frame causes the SE of the entire batch to be NA.

kenkellner commented 2 years ago

Ah, good points. This is going to be a bit more complicated than I thought. There is a huge amount of duplicated code across predict methods, so maybe it's finally time for me to re-write things in a standard way.

jniedballa commented 2 years ago

Looking at the predict methods, it might be rather straightforward for:

More difficult / non-standard for:

I don't know about these two, they call predict again internally.

My first impression is that it might be difficult to find a single standard method for all of these model types (but then I'm looking at the code for the first time).

Just speeding up predictions for the first four alone would be a huge help already. And I can imagine that these four are the most commonly used models of unmarked anyways.

kenkellner commented 2 years ago

Yeah, maybe it's best to speed things up for a few models quickly. I do think some kind of predict overhaul is needed though.

kenkellner commented 2 years ago

Working on this here

jniedballa commented 2 years ago

Hi Ken, thank you for your work on this. A colleague tested the latest version and found the prediction speed has improved significantly, especially for data frames with <1 million rows. His results were:

20k rows: 0.5s (was 30s) 1 mio rows: 55s (used to take hours) 10 mio rows: more than half an hour (was half a day)

Just rough numbers, but it shows that the method works in a real-world example. There seems to be a bit of an overhead with big data frames (memory-related?), but overall great to see the improved speed compared to the old predict function.

kenkellner commented 2 years ago

Thanks for testing it and glad to hear it is an improvement. When you and your colleagues are predicting on these large datasets (rasters?) are you typically actually using the calculated SE and CIs? I wonder if it's worth having an option to skip those calculations which I assume would speed things up even more.

jniedballa commented 2 years ago

The tests were done on data frames, not rasters. Would it make a difference? I thought predict() converts rasters to data frames first. But I agree it would be good to test nevertheless, we'll do that.

And my colleague and I agree that would be a valuable option to not return SE/CIs and have even faster predictions. When predict was too slow in the past I often took the raster values and model coefficients, ran the model formula manually and put the predicted values back into a raster. It was of course super fast since and didn't require any loop at all. So yes, that would be great to have as a convenient option directly in predict().

The help of stats::predict also says: "Many methods have a logical argument se.fit saying if standard errors are to returned."

kenkellner commented 2 years ago

Yes it should work the same either way, I was just speculating that the reason you had a million-row data frame was because you were doing some kind of spatial/raster work. And in my experience the SEs aren't as useful in that context.

I'll working on adding an se argument. I think some of the predict methods I added recently already have one.

jniedballa commented 2 years ago

Yes, it was a raster data set, but converted to data frame before passing it to predict (for easier testing of different number of rows). We'll do it again using the original raster as input and let you know how it compares. I agree that SEs are often not very useful in spatial predictions, and usually use CIs instead. Would you also make calculation of CIs optional?

jniedballa commented 2 years ago

Performance is virtually identical for the data frame and the raster method. Using the crossbill example from above, I get (for both data frame and raster):

10000 rows / cells: 0.3 sec 100000 rows / cells: 3.8 sec 1000000 rows / cells: 90 sec 2000000 rows / cells: 280 sec

As in my colleague's real-life example, it seems to progressively get slower with more cells (of course it is still lightning fast compared to before). And it's great there is no penalty for spatial work.

Here is some quick and dirty code used for testing:

nrows <- 1000000

# pepper in some NAs
number_nas <- 10000
which_nas <- sample(1:nrows, number_nas)

df_predict <- data.frame(intercept = 1, 
                         ele = seq(0, 2200, length.out = nrows),
                         forest = seq(0, 100, length.out = nrows))

df_predict$forest[which_nas] <- NA

r_predict1 <- raster(matrix(df_predict$ele, nrow = 1000, ncol = 1000))
r_predict2 <- raster(matrix(rev(df_predict$forest), nrow = 1000, ncol = 1000))

stack_predict <- stack(list(ele = r_predict1, 
                            forest = r_predict2))

system.time( 
  prediction_from_df <- predict(fm, df_predict, type = "state")
)

system.time(
  prediction_from_stack <- predict(fm, stack_predict, type = "state")
)

Memory doesn't seem to be a issue (around 900Mb for RStudio when testing with 1 million cells). So big rasters or data frames can probably be processed in parallel (would require tiling / splitting them), and would then need to be merged after processing. Not sure if it makes sense to include that in predict(), or leave it to the user to implement if they need the extra performance.

Also, different issue, if the user forgets the "type" argument in predict(), the error message is a bit cryptic and could be improved:

> predict(fm, df_predict)
Error in switch(type, state = { : EXPR must be a length 1 vector
kenkellner commented 2 years ago

Thanks for testing that, glad it works as expected. I agree that running things in parallel is a possibility, I will look into implementing that in a separate pull request, along with the issue with type (some model types handle this already I believe).