Benchmark GAMs to pick a resolution

usa-npn / cales-thermal-calendars

Estimating trends in phenology in the northeastern US

https://usa-npn.github.io/cales-thermal-calendars/spatial-trends-report.html

MIT License

0 stars 0 forks source link

Benchmark GAMs to pick a resolution #36

Closed Aariq closed 2 weeks ago

Aariq commented 1 month ago

In the stats report, I used a heavily aggregated version of the data for computational efficiency. Scaling back up the the original PRISM resolution (1km?) would likely be computationaly prohibitve. Spoke with Theresa and collaborators will likely want as high a resolution as we can get (i.e. there's no "sensible" level of aggregation for making inferences), so the next step is to do some benchmarking with different resolutions.

Things to consider:

Method of aggregation
Number of knots to use in spatial dimension (vary with resolution? keep constant? grid of a few values for k * a few resolutions?)
bam() vs. gam()? (bam() is faster for sure, but NCV (see #30) isn't an option with bam() unfortunately)
For now, fit with REML, but will this translate if we end up using NCV?

Aariq commented 1 month ago

gotta remember that bam has two ways to parallelize code as well, although both might not be super helpful with optimized BLAS.

Aariq commented 1 month ago

I think I need a plan, so adding @diazrenata to this issue so we can discuss at some point.

Aariq commented 1 month ago

Tested the discrete = TRUE option locally and on HPC with single or multiple threads and none of them outperform the default discrete = FALSE with method = "fREML".

Current strategy for benchmarking on the HPC is to fit a GAM using mgcv::bam() with the following function (in #26):

fit_bam <- function(data, k_spatial) {
  mgcv::bam(
    doy ~ ti(y, x, bs = "cr", d = 2, k = k_spatial) +
      ti(year_scaled, bs = "cr", k = 20) +
      ti(y, x, year_scaled, d = c(2,1), bs = c("cr", "cr"), k = c(k_spatial, k_year)),
    data = data,
    method = "fREML"
  )
}

Where the resolution of the data is either 50km, 25km, or 10km and k_spatial is 50, 100, 200, or 400. I suspect 400 is not enough to satisfy check.k(), but I'll start here and try to plot the results in some way that might allow us to extrapolate. I can get the time each target took from tar_manifest()'s seconds column.

Aariq commented 1 month ago

I haven't peeked at the actual models yet, but here are some results from k.check() and the time that these models took to fit on the HPC.

Here it shows that with 10 or 5km resolution we need more than 800 degrees of freedom for this smooth, but lower resolutions are satisfied with 400 or 800 degrees of freedom.

The highest resolution model (5km) with 800 degrees of freedom takes only about 80 min to fit, however the increase in time is clearly non-linear as more df are added.

All models were fit with 5 CPUs and 25GB of RAM.

Aariq commented 1 month ago

Next step is to get maps of average slopes for some of these models to see if they are qualitatively different. To start with, maybe 50km/400df, 25km/400df, 25k/800df, and 10k/800df.

Aariq commented 4 weeks ago

It looks like there's not a big difference between 400 and 800 df for 25km resolution. The big limitation is actually calculating estimated slopes I think. So more testing needed with that.

Aariq commented 2 weeks ago

Even better with discrete = TRUE and samfrac = 0.1 Screenshot 2024-08-19 at 2 43 13 PM (1)