l_densCheck's default dFun should use the model's scale not the residual's standard deviation

fabian-s commented 6 years ago

Not sure how often this comes up if you don't set the scale manually but it seems dangerous to me:

library(mgcv)
library(mgcViz)
n <- 1e3
x1 <- rnorm(n)
x2 <- rnorm(n)
dat <- data.frame("x1" = x1, 
                  "x2" = x2, 
                  "y" = rnorm(n, sin(x1) + x2^2, sd = 1))
# fit with much too small scale:
b_scaled <- gam(y ~ s(x1) + s(x2), data=dat, scale = .1)
qq.gam(b_scaled)


check1D(getViz(b_scaled), x = "x1") + l_densCheck() + scale_fill_gradient2()
#> Scale for 'fill' is already present. Adding another scale for 'fill',
#> which will replace the existing scale.

# looks ok (unstructured wiggles), but we *know* the model variance is much too small:

# Comparing with Gaussian density with model's scale, *not* empirical sd of residuals:
fasiolo_dist <- function(.ed, .gr, .y) {
  td <- dnorm(.gr, 0, sd = .1)
  sign(.ed - td) * (abs(sqrt(.ed) - sqrt(td)))^1/3
}

check1D(getViz(b_scaled), x = "x1") + l_densCheck(dFun = fasiolo_dist) + 
  scale_fill_gradient2()
#> Scale for 'fill' is already present. Adding another scale for 'fill',
#> which will replace the existing scale.

# looks as it should: shows underdispersion of residuals

Created on 2018-09-11 by the reprex package (v0.2.0).

mfasiolo commented 6 years ago

Good point, I think that one possible solution would be to normalize the residuals using the residuals simulated from the model, as done in l_gridCheck1D.

By doing that the normalization will work also for general cases (e.g. gaulss), when it is not possible to normalize using, for instance, fittedGam$sig2 (for some families the variance function is not provided by mgcv at all).

mfasiolo commented 6 years ago

Ok, the easiest solution to avoid the potential misuse of this function is using the residuals types "tnormal" or "tunif", under which the reference distribution are standard normal or U(0, 1). If residuals "tnormal" or "tunif" are unavailable (because the CDF of the response distribution is unavailable), then an alternative solution is providing an user-defined distance. The function now warns about this potential problem, which is explained in the updated documentation (?l_densCheck). The relevant commit is 36257ff334a04515e4975b10d1ec283a2775ac68

mfasiolo / mgcViz

l_densCheck's default dFun should use the model's scale not the residual's standard deviation #45