Closed jniedballa closed 1 year ago
@jniedballa Thank you for the suggested improvement. @kenkellner Is this something you could make time for? I agree that predict is too slow right now, and it fails with huge rasters. It seems like the batch processing would work well, and it could be done in parallel.
Thanks @jniedballa, this is very helpful. Definitely agreed on the speed issues with predict
. I'll look into this. It doesn't really make sense to me why it is faster to do the linearComb
- backTransform
- SE
sequence in batches instead of all at once. There's a series of matrix multiplications happening in in there - maybe there's an issue with those once the number of rows gets large enough?
It does seem to be the matrix multiplication in linearComb
:
Batching this is faster than doing the multiplication all at once. It doesn't help to divide the calculation up into parts either. I tried doing the multiplication in C++ instead and it was even slower (not surprising I guess, I'm sure base R is doing matrix calculations in C)
I think it makes the most sense to implement the batching loop in linearComb
and not in predict
since there are many individual predict
methods but most of them call linearComb
at some point.
I'm glad you're positive about the suggestion.
You would know best how to implement it, the multitude of methods in unmarked and how they are all implemented is a bit over my head. Your idea sounds reasonable though.
But even if you modify linearComb to loop over batches, you'd still need to change every predict method since currently they all loop over the rows of the prediction data frame which wouldn't be necessary anymore.
And of course there also needs to be a way of taking care of rows with NAs. In the suggested method a single NA in the prediction data frame causes the SE of the entire batch to be NA.
Ah, good points. This is going to be a bit more complicated than I thought. There is a huge amount of duplicated code across predict methods, so maybe it's finally time for me to re-write things in a standard way.
Looking at the predict methods, it might be rather straightforward for:
More difficult / non-standard for:
I don't know about these two, they call predict again internally.
My first impression is that it might be difficult to find a single standard method for all of these model types (but then I'm looking at the code for the first time).
Just speeding up predictions for the first four alone would be a huge help already. And I can imagine that these four are the most commonly used models of unmarked anyways.
Yeah, maybe it's best to speed things up for a few models quickly. I do think some kind of predict overhaul is needed though.
Working on this here
Hi Ken, thank you for your work on this. A colleague tested the latest version and found the prediction speed has improved significantly, especially for data frames with <1 million rows. His results were:
20k rows: 0.5s (was 30s) 1 mio rows: 55s (used to take hours) 10 mio rows: more than half an hour (was half a day)
Just rough numbers, but it shows that the method works in a real-world example. There seems to be a bit of an overhead with big data frames (memory-related?), but overall great to see the improved speed compared to the old predict function.
Thanks for testing it and glad to hear it is an improvement. When you and your colleagues are predicting on these large datasets (rasters?) are you typically actually using the calculated SE and CIs? I wonder if it's worth having an option to skip those calculations which I assume would speed things up even more.
The tests were done on data frames, not rasters. Would it make a difference? I thought predict() converts rasters to data frames first. But I agree it would be good to test nevertheless, we'll do that.
And my colleague and I agree that would be a valuable option to not return SE/CIs and have even faster predictions. When predict was too slow in the past I often took the raster values and model coefficients, ran the model formula manually and put the predicted values back into a raster. It was of course super fast since and didn't require any loop at all. So yes, that would be great to have as a convenient option directly in predict().
The help of stats::predict also says: "Many methods have a logical argument se.fit saying if standard errors are to returned."
Yes it should work the same either way, I was just speculating that the reason you had a million-row data frame was because you were doing some kind of spatial/raster work. And in my experience the SEs aren't as useful in that context.
I'll working on adding an se argument. I think some of the predict methods I added recently already have one.
Yes, it was a raster data set, but converted to data frame before passing it to predict (for easier testing of different number of rows). We'll do it again using the original raster as input and let you know how it compares. I agree that SEs are often not very useful in spatial predictions, and usually use CIs instead. Would you also make calculation of CIs optional?
Performance is virtually identical for the data frame and the raster method. Using the crossbill example from above, I get (for both data frame and raster):
10000 rows / cells: 0.3 sec 100000 rows / cells: 3.8 sec 1000000 rows / cells: 90 sec 2000000 rows / cells: 280 sec
As in my colleague's real-life example, it seems to progressively get slower with more cells (of course it is still lightning fast compared to before). And it's great there is no penalty for spatial work.
Here is some quick and dirty code used for testing:
nrows <- 1000000
# pepper in some NAs
number_nas <- 10000
which_nas <- sample(1:nrows, number_nas)
df_predict <- data.frame(intercept = 1,
ele = seq(0, 2200, length.out = nrows),
forest = seq(0, 100, length.out = nrows))
df_predict$forest[which_nas] <- NA
r_predict1 <- raster(matrix(df_predict$ele, nrow = 1000, ncol = 1000))
r_predict2 <- raster(matrix(rev(df_predict$forest), nrow = 1000, ncol = 1000))
stack_predict <- stack(list(ele = r_predict1,
forest = r_predict2))
system.time(
prediction_from_df <- predict(fm, df_predict, type = "state")
)
system.time(
prediction_from_stack <- predict(fm, stack_predict, type = "state")
)
Memory doesn't seem to be a issue (around 900Mb for RStudio when testing with 1 million cells). So big rasters or data frames can probably be processed in parallel (would require tiling / splitting them), and would then need to be merged after processing. Not sure if it makes sense to include that in predict(), or leave it to the user to implement if they need the extra performance.
Also, different issue, if the user forgets the "type" argument in predict(), the error message is a bit cryptic and could be improved:
> predict(fm, df_predict)
Error in switch(type, state = { : EXPR must be a length 1 vector
Thanks for testing that, glad it works as expected. I agree that running things in parallel is a possibility, I will look into implementing that in a separate pull request, along with the issue with type
(some model types handle this already I believe).
Dear unmarked maintainers, predict() can be very slow, which is most likely due to it looping over the rows of the prediction data frame one by one. I suggest a modification to predict which in my tests achieved a ~50x speedup compared to the current method (in a single-species single-season occupancy model). Example below.
Create data and model
Let's use the crossbill example (modified to make it a single-season model for demonstration:
And a prediction data frame with 10000 rows:
predict()
Normal predict() takes about 15 seconds:
vectorized (slow)
All functions that are used inside the prediction loop accept vectorized input (matrices), so it is possible to avoid the loop. With many rows, however, it becomes extremely slow. Example:
by modifying nrows, I get:
So there are considerable gains to be made by predicting on batches instead of individual rows, but too many rows make it slow.
vectorized (fast)
To take advantage of the speed gains with vectorization while avoiding low performance on many rows I suggest splitting the prediction data frame into batches and looping over these batches. Example:
Results are identical:
but it only took 0.26 sec instead of 15 sec. It seemed to be fastest with batch sizes between 50 and 100.
I imagine the method should generalize well to rasters and other model classes, but I didn't test that. split() currently gives a (benign) warning if the last batch is smaller than the previous ones. It's just for demonstration after all.
Code above is based on https://github.com/rbchan/unmarked/blob/master/R/unmarkedFit.R, lines 387-404.
Any way of speeding up predict() would be welcome. As much as I love unmarked, predict() being very slow has always been my main problem with the package (especially when predicting on rasters). Thank you for your work on the package. Best regards, Jürgen