Slightly off topic: estimating smooths from gam()

So this is slightly off topic, but it came up in a meeting I was having today and I was wondering if you had a good solution.

I'm working with a student here on a model we came up with and are fitting with mgcv::gam(). One thing we were interested in was looking at coverage probabilities for confidence intervals created based on gam's estimate of the variance/covariance matrix. We were getting poor performance, so we went down to the simplest case we could come up with: a linear regression with a smooth effect of a scalar covariate. Coverage was still quite poor.

Here's what we did: we simulated some data according to the model $E[Y] = 20 + 3x$, but estimated it with gam(y ~ s(x)). The problem we realized is that the model is over-parameterized: since it includes an intercept, there is no way for the model to know whether some of the "intercept" is part of the "f(x)". So the estimation would be impossible, without some constraints on f. Do you know how these constraints are determined?

For the above scenario, we could fix it by just removing the intercept, or by considering the intercept to be part of f(x) and just looking at $\hat y$. Doing this does result in confidence intervals with 95\% coverage, btw. But what if the model was $\alpha + f(x) + g(z)$? Removing $\alpha$ doesn't help you untangle f vs. g. So in general, what are \hat f and \hat g trying to estimate? And in particular, how can we evaluate the standard error estimates that gam() provides? Any ideas?

Btw, in pcox we will also get this effect again, but the "intercept" is absorbed into the baseline hazard, making it even more troublesome.

jgellar / pcox

Slightly off topic: estimating smooths from gam() #18