pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
186 stars 26 forks source link

get_index potential computation issues for very large prediction grids #334

Open Lewis-Barnett-NOAA opened 6 months ago

Lewis-Barnett-NOAA commented 6 months ago

Unfortunately, this is another one that is hard to offer an MWE for due to the model complexity, but I've found that get_index is producing NAs for the index and its SD when the prediction grid/prediction object includes too many points. In the example I was working with, if I split the same prediction grid into 2 subareas, the subarea indices compute as expected. If I can pin the issue down more definitively I'll update this issue.

Possible solutions: 1) Leave it up to the user to coarsen their prediction grid as needed, and perhaps make a note about this somewhere in the docs 2) Include built-in functionality to coarsen the grid, to an extent based on a user-specified argument

sebpardo commented 5 months ago

I often encounter memory allocation errors due to very large prediction grids and cutting the grid into chunks works well for me; a function that automates this could be a third solution to this issue:

max_chunk_length <- 500000 
grid_chunks <- prediction_grid%>% group_split(group_id = row_number() %/% max_chunk_length) 

predict_list <- map(grid_chunks,
                    ~predict(fit, newdata = .x)$est %>%
                         fit$family$linkinv(.))
unlist(predict_list)
seananderson commented 5 months ago

Nice idea, @sebpardo. For grids that are still small enough to predict on for a single year at a time, the simplest would be to iterate over each year (or set of years). E.g. https://github.com/pbs-assess/sdmTMB/discussions/314#discussioncomment-8749844

I'm in the midst of overhauling how extra time slices are handled and as part of that, get_index() should also become a bit more memory efficient with this kind of slicing.

I'm not sure how to catch this, other than maybe we can check for NAs in the index data frame and issue a message suggesting this may be the issue and how to work around it.

We could also do some benchmarking and perhaps it makes sense to do the bias correction by year by default (somewhere I have a script where I was already doing this).

And be sure you're on the latest version because I made this overall more memory efficient recently.

Lewis-Barnett-NOAA commented 5 months ago

Good ideas, thanks fellas. The chunking of the prediction grid has worked better for me in practice than the aggregating, but seems slower. I'll keep experimenting through the latest updates.

On Wed, Apr 17, 2024 at 1:37 PM Sean Anderson @.***> wrote:

Nice idea, @sebpardo https://github.com/sebpardo. For grids that are still small enough to predict on for a single year at a time, the simplest would be to iterate over each year (or set of years). E.g. #314 (reply in thread) https://github.com/pbs-assess/sdmTMB/discussions/314#discussioncomment-8749844

I'm in the midst of overhauling how extra time slices are handled and as part of that, get_index() should also become a bit more memory efficient with this kind of slicing.

I'm not sure how to catch this, other than maybe we can check for NAs in the index data frame and issue a message suggesting this may be the issue and how to work around it.

We could also do some benchmarking and perhaps it makes sense to do the bias correction by year by default (somewhere I have a script where I was already doing this).

And be sure you're on the latest version because I made this overall more memory efficient recently.

— Reply to this email directly, view it on GitHub https://github.com/pbs-assess/sdmTMB/issues/334#issuecomment-2062220439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKMJP4IM45J6AFNEMFBDN3Y53MPTAVCNFSM6AAAAABF7DDAB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGIZDANBTHE . You are receiving this because you authored the thread.Message ID: @.***>

-- Lewis Barnett, PhD (he/him/his) Research Fish Biologist

NOAA Fisheries, Alaska Fisheries Science Center 7600 Sand Point Way NE, Bldg 4 Seattle, Washington 98115 Google Voice: (206) 526-4111

seananderson commented 5 months ago

On second though, that approach by @sebpardo above will work for prediction on large grids, but won't work if your goal is to do spatial integration with uncertainty, as in with get_index().

I've been doing some memory profiling of various approaches:

image

It appears using lapply and do.call on a year-by-year prediction is a huge win on memory for large grids at the expense of time (mostly because of all the separate predictions, I imagine). It might be worth having an argument in get_index that integrates for a specific year or set of years so the prediction could still just be done once.

Interestingly, purrr::map_dfr is very memory hungry compared to map or lapply followed by converting the list to a data frame after.

I'll post code later when I've worked this out more.

@MikkoVihtakari tagging you in this issue too.

KristaDBaker commented 3 months ago

I have been running into this issue a lot lately too, even when making the prediction grids relatively large. Has there been any movement on a more permanent fix (e.g., argument within get_index())? Thanks.

seananderson commented 3 months ago

I wrote an experimental function in a new branch 'index-split'. https://github.com/pbs-assess/sdmTMB/commit/13a55d44639f56d852e85a6916ba305a5509bfd2

Try it out and let me know if you have any issues or interface suggestions before I merge into main.

Install with:

pak::pak("pbs-assess/sdmTMB@index-split")

See the function ?get_index_split.

There can't just be an argument in get_index() because the way sdmTMB is set up, the prediction and index calculation are two steps. So, we need something that wraps those.

Note this will be slower if you don't need it. It's a speed memory tradeoff. See the rough figure above.

library(sdmTMB)
mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)
m <- sdmTMB(
 data = pcod,
 formula = density ~ 0 + as.factor(year),
 time = "year", mesh = mesh, family = tweedie(link = "log")
)
nd <- replicate_df(qcs_grid, "year", unique(pcod$year))
ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■                  50% | ETA:  0s
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
ind
#>   year      est       lwr      upr  log_est        se  type
#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index
#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index
#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index
#> 4 2007 119509.7  86514.07 165089.6 11.69115 0.1648452 index
#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index
#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index
#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index
#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index
#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index

Created on 2024-06-26 with reprex v2.1.0

Lewis-Barnett-NOAA commented 3 months ago

Thanks Sean, I'll do some testing with this for the Bering groundfish biomass indices and let you know if I notice anything about how the speed/memory tradeoff works for our servers. We are also exploring some cloud computing options that could of course help solve these issues as well.

On Wed, Jun 26, 2024 at 1:16 PM Sean Anderson @.***> wrote:

I wrote an experimental function in a new branch 'index-split'. 13a55d4 https://github.com/pbs-assess/sdmTMB/commit/13a55d44639f56d852e85a6916ba305a5509bfd2

Try it out and let me know if you have any issues or interface suggestions before I merge into main.

Install with:

@.***")

See the function ?get_index_split.

There can't just be an argument in get_index() because the way sdmTMB is set up, the prediction and index calculation are two steps. So, we need something that wraps those.

Note this will be slower if you don't need it. It's a speed memory tradeoff. See the rough figure above.

library(sdmTMB)mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)m <- sdmTMB( data = pcod, formula = density ~ 0 + as.factor(year), time = "year", mesh = mesh, family = tweedie(link = "log") )nd <- replicate_df(qcs_grid, "year", unique(pcod$year))ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■ 50% | ETA: 0s#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s

ind#> year est lwr upr log_est se type#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index#> 4 2007 119509.7 86514.07 165089.6 11.69115 0.1648452 index#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index

Created on 2024-06-26 with reprex v2.1.0 https://reprex.tidyverse.org

— Reply to this email directly, view it on GitHub https://github.com/pbs-assess/sdmTMB/issues/334#issuecomment-2192550448, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKMJP62D2YIEB5PDNAFWKDZJMOSDAVCNFSM6AAAAABF7DDAB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGU2TANBUHA . You are receiving this because you authored the thread.Message ID: @.***>

-- Lewis Barnett, PhD (he/him) Research Fish Biologist

NOAA Fisheries, Alaska Fisheries Science Center 7600 Sand Point Way NE, Bldg 4 Seattle, Washington 98115 Google Voice: (206) 526-4111