Questions - Githubissues

mauro3 commented 7 years ago

Sorry to abuse the issue-tracker for questions: You mention "conversion between regularization parameter λ and degrees of freedom". How would I do that? Is there a way to get the knots where the individual splines join? (or is that the wrong way to look at it entirely?)

nignatiadis commented 7 years ago

Hi! I am not sure I completely understand your question, but I will take a stab at it. I think the confusion might be because the term degrees of freedom is "flawed", or at the very least often not formally defined.

It's easiest to motivate this using standard linear regression: Y= X\beta + \varepsilon, where X is your nxp predictor matrix, \beta (px1) is your coefficient vector, \varepsilon is the noise and Y is the observed response.

At least intuitively, the degrees of freedom in this model are p. We would like a more general definition though, that might apply to a larger class of models.

Note that if X has full rank, then the least squares estimate is \hat{\beta} = (X^T X)^{-1}X^T Y and your fitted (predicted) Ys are:

\hat{Y} = X\hat{\beta} = X (X^T X)^{-1} X^T Y

So if we call H = X (X^T X)^{-1} X^T (the so called hat-matrix), then \hat{Y} = HY.

It now turns out that one can show that Trace(H)=p. Hence we take Trace(H) as the definition of degrees of freedom (there are even further generalization but it is enough for smoothing splines).

To see that this is more generally applicable, consider the following: Let λ > 0 and define (I being the identity matrix):

S = X (X^T X + λΙ)^{-1} X^T

and let \hat{Y} = S Y

By adding this diagonal to X^T X, we "damp down" our predictions, i.e. we regularized ("ridge/tikhonov regularization"). Intuitively, this model should have less degrees of freedom than p (and in fact, the larger λ, the smaller the degrees of freedom).

Indeed with the definition df = Trace(S), it will turn out that for λ=0, df = p , but as λ goes to infinity, df goes to 0, so at least this definition makes some sense.

Now what is a smoothing spline? In constrast to a regression spline, here we place knots at every sample point x_i, thus resulting in a design matrix X of dimension n x n! Now imagine using standard linear regression of Y onto X, then (if X full rank, it is invertible), so:

H = X(X^T X)^{-1} X^T = I

\hat{Y} = I Y = Y

Degrees of freedom are equal to n.

This is silly since we just interpolated the data. But now if you do the ridge trick, i.e. add some positive definite matrix V (actually in this case it's not diagonal) to X^T X, i.e. take

S_λ = X( X^T X + λV)^{-1} X^T

and fitted values \hat{Y} = S_λ Y

then as λ grows, your degrees of freedom shrink. It turns out that penalizing the curvature of your fitted function (with regularization parameter λ), corresponds to exactly this type of shrinkage for a matrix V.

Hence given λ, you get your smoother matrix S_λ, and thus the degrees of freedom df = Trace(S_λ). That's a 1-1 mapping and given the degrees of freedom, you can figure out what λ was.

Sometimes people prefer to think of it in one way or another (and degrees of freedom, even though flawed, at least provides one possibility of measuring complexity of different models).

Does this make sense? The element of statistical learning is a great book to learn about these things.

mauro3 commented 7 years ago

Thanks for your detailed answer! I guess it mostly boils down to my ignorance of statistics, which will stay that way due to time constraints... What I'm after, is trying to understand the lambda parameter and how to set it. The degree-of-freedom is something I, a deterministic modeler, have a grasp on (although, not in the statistical sense).

So how do I calculate Trace(S_λ) from spl?

Maybe another way to ask my question would be: what is the smallest wave-length present in the smoothing spline. (I would expect that the smallest wave-length is somewhere around the smallest distance between used knots.) For example, if I take a noisy sine curve sin(x)+0.1*randn(...) and would like to reproduce it, trial and error gives me a lambda=0.01 .. 1.0. But that lambda depends on the frequency of the sine wave and I need a different lambda fitting sin(x/10) + randn(...). But I expect the DOFs to be the same for both frequencies. So, how do I choose lambda to keep the DOFs similar?

mauro3 commented 7 years ago

Trial and error gives me a transform lambda = 0.001 * wave_length^3. Could this be right?

randy3k commented 7 years ago

@mauro3

Did you figure out how to extract the smoothing matrix $S_\lambda$?

mauro3 commented 7 years ago

This https://github.com/mauro3/VAWTools.jl/blob/2ce38557af2fce8056af79880192209d6636dfb2/src/VAWTools.jl#L1446 is what I use, seems to work. But it was just trial and error.

randy3k commented 7 years ago

Thanks for the link. But I didn't mean the trace, but the actually smoothing matrix...

randy3k commented 7 years ago

Found it, though it is not very efficient in computing R \ Q' and the inv. (And it doesn't handle ties).

function smoothingmatrix(spl::SmoothingSpline)
    h = diff(spl.Xdesign)
    λ = spl.λ
    Q = SmoothingSplines.ReinschQ(h)
    R = SmoothingSplines.ReinschR(h)
    inv(eye(length(h) + 1) + λ * (Q * (R \ Q')))
end

nignatiadis / SmoothingSplines.jl

Questions #2