Some modeling tips - Githubissues

nicholasjclark commented 1 month ago

Hi @serena-psc, I started drafting a response to your Cross-Validated post but I have just seen that it was closed. Sorry about that, it seems the moderators want a more focused question from you. But anyway, here are some things you could consider:

library(dplyr)
library(mgcv)
library(marginaleffects)
library(gratia)
library(ggplot2); theme_set(theme_bw())

# Use bam and the betar() family for more efficiency and a more appropriate
# likelihood
gam1 <- bam(Percent ~ GroupName + 
              CycleLine +
              s(JulianDate) + 
              s(JulianDate, by = GroupName) +
              s(JulianDate, by = CycleLine),
            data = dat,
            method = "fREML",
            family = betar(),
            discrete = TRUE,

            # select = TRUE will place extra penalties to help with
            # variable selection
            select = TRUE,
            nthreads = 2)
summary(gam1)

# Plot partial effects (i.e. smooths conditional on all other effects being
# set to zero) using gratia's draw() functionality
draw(gam1)

# With so many interactions, it is difficult to make sense of the model outputs by 
# simply looking at component smooths. Instead, it is far easier to use plot_predictions
# to inspect model assumptions and make inferences
plot_predictions(gam1, 
                 condition = c('JulianDate', 'CycleLine', 'GroupName'))

# Comparisons can be used for targeted inferences:
# How do stocks compare over JulianDate, on average (averaging across all
# levels of CycleLine)?
plot_comparisons(gam1, 
                 variables = "GroupName", 
                 condition = c("JulianDate")) +
  geom_hline(yintercept = 0, linetype = 'dashed')

# How do Cycles compare over JulianDate, on average (averaging across all
# stocks)?
plot_comparisons(gam1, 
                 variables = "CycleLine", 
                 condition = c("JulianDate")) +
  geom_hline(yintercept = 0, linetype = 'dashed')

# Look for autocorrelated residuals for each series
dat %>%
  mutate(resids = residuals(gam1, type = 'working')) %>%
  group_by(GroupNameCycleLine) %>%
  arrange(JulianDate) %>%
  mutate(autocor = Box.test(resids, type = 'Ljung-Box')$p.value) %>%
  summarise(autocor = mean(autocor),
            sig = mean(autocor) < 0.05)

# Still present, so need to consider modeling this; look into
# the 'rho' and 'AR.start' arguments in bam()

serena-psc commented 1 month ago

Hi Nicholas,

Thank you! I really appreciate your response and advice. I've edited my original post to hopefully make my questions more focused and also to take some of your feedback into account.

I looked into setting the rho and AR.start arguments in bam(). I used a variation of gam3from my original post:

# First mark the start of each time series as TRUE, and all other data points as FALSE
data %<>%
  arrange(GroupName, Year, CycleLine, JulianDate) %>% 
  as.data.frame()

simdat <- itsadug::start_event(data, column = "JulianDate", 
                                         event = c("GroupName", "CycleLine"),  # is this correct? Or should I be doing it by Year?
                                         label.event = "Event")

r1 <- itsadug::start_value_rho(gam3, plot = TRUE)
# r1 is 0.8802352

gam5 <- bam(Percent ~ GroupName + CycleLine + 
          s(JulianDate) + 
          s(JulianDate, by = GroupName) +
          s(JulianDate, by = CycleLine), 
        data = simdat, 
        rho = r1, 
        AR.start = simdat$start.event,
        method = "fREML", 
        family = betar(), 
        discrete = T, 
        select = T, 
        nthreads = 2)

but when I look at the uncorrected vs. 'corrected' residuals I'm still seeing autocorrelation:

# Uncorrected versus corrected residuals:
par(mfrow = c(1, 2), cex = 1.1)
itsadug::acf_resid(gam3)
itsadug::acf_resid(gam5)
enter image description here

No worries if you don't have the bandwidth to answer but I'm curious how can I better account for the autocorrelation? My understanding that rho is just for an AR1 but the results of auto.arima suggest that the process is more complicated? And the r1 value from itsadug::start_value_rho(gam3, plot = TRUE) doesn't match the output of forecast::auto.arima(residuals(gam3))$coef.

forecast::auto.arima(residuals(gam3))$coef produces:

Either way, thanks again and best wishes!

nicholasjclark commented 1 month ago

Hi @serena-psc, I suppose that all depends on what the main questions are for this analysis. If you want to capture all of the temporal variation, then simply setting k to be larger will help to do that. But this will make it more challenging to understand broader average effects of your grouping factors (Cycleline and GroupName). Just from looking at a few series in your data, there seem to be some repeated seasonal effects in some of them so I'd say there are periodicities that aren't being captured in your current model. It may be worth looking into those as well

serena-psc / ExampleData

Some modeling tips #1