hendersontrent / GAM.jl

Fit, evaluate, and visualise generalised additive models (GAMs) in native Julia
MIT License
28 stars 3 forks source link

build penalty matrix for multiple splines #32

Closed yahrMason closed 3 weeks ago

yahrMason commented 1 year ago

This is looking awesome. Super well done. I am just putting this out there to keep track of the idea. On page 177 of his text, Wood describes how how to build the penalty matrix for a GAM with more than one term. Described in R terms, (whats comfortable for me), basically it consists of a cbind of the basis matricies and then rbinding a block diagonal matrix of the penalized difference matrices. I provided a code example below. This will allow for a single optimization run for a vector of penalty terms. Thankfully due to Julia's broadcasting capabilities, I think there should be minimal refactoring necessary. I think it might also help with the fit, but I am not 100% certain if that is the case. I am mor than happy to look into building this out, but I know development has been fast as furious so don't want to get in the way and break things. Below is code example assuming the trees dataset you are using in your tests.

using SparseArrays

# response variable
y = trees.Volume

# create vector if input variables
x = [trees.Girth, trees.Height]

# broadcast basis functions
Basis = QuantileBasis.(x,10,4)
X = BasisMatrix.(Basis, x) # Basis Matrix
D = DifferenceMatrix.(Basis) # D penalty matrix

# vector of penalties
λ = [10, 10]

# penalty design matrix
X_p = Matrix(
    vcat(
        # cbind the Basis Matricies
        hcat(X...), 
        # create a block diagonal matrix of penalized differences
        blockdiag((sqrt.(λ).*sparse.(D))...)
    )
)

# augmented penalty response
y_p = vcat(y, repeat([0],sum(first.(size.(D)))))

# fit model
lm(X_p, y_p)

Again, awesome work!

hendersontrent commented 1 year ago

Ahh! Thanks for this — my copy of Wood’s text is elsewhere at the moment so I have been building from memory. Thank you, I’ll take a look into this and comment back if I can’t solve it.

hendersontrent commented 1 year ago

Hi @yahrMason , since I don't have my copy of the text with me, can you please remind me (at your leisure) how he describes handling situations where you have both covariates requiring a smooth and covariates that don't (i.e., discrete predictors)?

yahrMason commented 1 year ago

Hi @hendersontrent, that's a good question. As I am reading through the text, he builds up to a GAM model from scratch (using the familiar trees dataset) which has two smooth terms. As far as I can tell he doesn't go into detail about adding in discrete terms other than using {mgcv} to estimate models with them. My intuition suggests that column binding them to the penalized design matrix makes sense, but I can't be certain if that is the "appropriate" case. He does go into detail later in the text about needing different smoothing selection criterion for handling binary data (as opposed to the PIRLS approach I mentioned in #33. It might warrant diving into the {mgcv} source code.

hendersontrent commented 1 year ago

Alrighty, I think I'll work on getting the case where all predictors are smooths working first, then worry about extensibility to combinations!

hendersontrent commented 6 months ago

@yahrMason reviving this now I’m trying to get my head back in it! Are you saying this approach replaces the first for loop over the list of covariates in gam_fit.jl?

hendersontrent commented 6 months ago

One question I have, is that while broadcasting makes perfect sense, how do we ensure that the user specified knots and degrees for each covariate receive those potentially different values? Your line Basis = QuantileBasis.(x,10,4) hard codes in values, but we need to account for different arguments to QuantileBasis for each covariate.

yahrMason commented 6 months ago

hey @hendersontrent. Yes effectively this could replace the loop. I think we still can use a loop here instead of broadcasting for building the different basis for each covariate (with varying degrees). But we would want to collect those in a list and then run the following code to build the final design matrix. I think you want to optimize the lambdas all at once.

# penalty design matrix
X_p = Matrix(
    vcat(
        # cbind the Basis Matricies
        hcat(X...), 
        # create a block diagonal matrix of penalized differences
        blockdiag((sqrt.(λ).*sparse.(D))...)
    )
)
yahrMason commented 6 months ago

In the above, X_p and Y_p would represent what is returned the by the PenaltyMatrix function. That could then be used in the subsequent fitting functions. Maybe the task is to generalize the PenaltyMatrix function to accept a list of Basis objects and return X_p and Y_p as shown above?

hendersontrent commented 6 months ago

@yahrMason how about λ? Since in my code so far I am optimising it for each covariate independently (as per below code), should I do that optimisation independently as it currently is (meaning PenaltyMatrix will then accept an array of Basis objects and an array of λ_opt -- one for each covariate)?

Basis = QuantileBasis(x, k, degree) λ_opt = OptimizeGCVLambda(Basis, x, y, optimizer) Xp_opt, yp_opt = PenaltyMatrix(Basis, λ_opt, x, y)

hendersontrent commented 6 months ago

@yahrMason below is my current work in progress of the refactor. The X_p build fails due to dimension mismatch, but I just wanted to check:

  1. Does this procedure look right to you (I didn't go the broadcast route since I want to flexibly handle non-smooth covariates which is the if lookup[i, 4] == true conditional)?
  2. Is this the right handling of non-smooth covariates to ensure a consistent once-through procedure rather than what I previously had?
  3. Any thoughts on the dimension mismatch?
intercept = DataFrame(Intercept = ones(size(data, 1)))
data = hcat(intercept, data)
ModelFormula = split(ModelFormula, " ~ ")[1] * " ~ :Intercept + " * split(ModelFormula, " ~ ")[2]
GAMForm = ParseFormula(ModelFormula)
y_var = GAMForm.y
y = data[!, y_var]

BasisMatrices = []
Differences = []
λs = []

for i in 1:nrow(GAMForm.covariates)
    if lookup[i, 4] == true
        col_data = data[!, GAMForm.covariates[i, 1]]
        knots = quantile(col_data, range(0, 1; length = lookup[i, 2]))
        basis = BSplineBasis(GAMForm.covariates[i, 3], knots)
        X = BasisMatrix(basis, col_data) # Basis Matrix
        D = DifferenceMatrix(basis)
        λ = OptimizeGCVLambda(basis, col_data, data[:, y_var], optimizer)
        push!(BasisMatrices, X)
        push!(Differences, D)
        push!(λs, λ)
    else
        push!(BasisMatrices, data[!, lookup[i, 1]])
        D = diffm(diagm(0 => ones(length(data[!, lookup[i, 1]]))), 1, 2)
        λ = 1
        push!(Differences, D)
        push!(λs, λ)
    end
end

# Build penalised design matrix

X_p = Matrix(
    vcat(
        # cbind the Basis Matricies
        hcat(BasisMatrices...), 
        # create a block diagonal matrix of penalized differences
        blockdiag((sqrt.(λs).*sparse.(Differences))...)
    )
)

# Build augmented penalty response variable

y_p = vcat(y, repeat([0],sum(first.(size.(Differences)))))
yahrMason commented 6 months ago

Yeah this looks pretty close. I think there is a way to adjust the OptimizeGCVLambda function to accept an array of basis and optimize for all lambdas at once. I can take a stab at that this weekend.

hendersontrent commented 6 months ago

That makes sense. I can try and have a go before then too. Any thoughts on how the non-smooth terms should be handled in this pipeline? Do they get appended to X_p following the optimisation procedure etc. or should they be there for that procedure?

hendersontrent commented 6 months ago

@yahrMason #36 now collates my ongoing work for this, if you are looking for an up-to-date reference

yahrMason commented 6 months ago

It's not clear from the text how non-smooth predictors should be handled. My intuition suggests that it all should be done in one optimization step.

yahrMason commented 5 months ago

@hendersontrent I figured out what is causing the dimension mismatch. Will try to grab some time tonight to resolve. Will keep you posted on my progress.

hendersontrent commented 5 months ago

@yahrMason awesome! Sorry I wanted to work on this package over the weekend but had to work on some others!

yahrMason commented 5 months ago

@hendersontrent here is an updated calculation for computing the design matrix and GCV. This could probably be split up into several functions to build the penalty matrix and GCV, but I kept it all in here to be concise. I updated the GCV math to make it a bit simpler to read and follow closer to the code in the text.

function GCV(param::AbstractVector, BasisMatrices::AbstractVector, y::AbstractVector, Differences::AbstractVector)
    n = length(y)
    k = length(BasisMatrices)
    lambdas = param[1:k]

    ### Build Design Matrix
    # Define Design Matrix Intercept
    X_int = vcat(
            repeat([1], n),
            repeat([0], sum(size.(Differences,1)))
        )
    # Initialize list of Design Matrix Elements
    X_elements = []
    for i in 1:k
        #  Get Basis Matrix For Predictor
        X_i = BasisMatrices[i]
        # Build List of Empty Difference Matrices
        D_i = [zeros(size(D,1), size(Differences[i],2)) for D in Differences]
        # Add in Penalized Difference Matrix
        D_i[i] = sqrt(lambdas[i]) * Differences[i]
        # Concatenate Penalized Difference Matrix
        D_i = vcat(D_i...)
        # Append to Design Matrix
        push!(X_elements, vcat(X_i, D_i))
    end
    # Could add more non-smooth predictors here...

    X = Matrix(hcat(X_int, X_elements...))
    Y = vcat(y, repeat([0],sum(size.(Differences,1))))

    ### Compute GCV
    # fit regression model
    b = X \ Y # ceofficients
    H = X * inv(X' * X) * X' # Hat Matrix
    trA = sum(diag(H)[1:n]) # EDF
    y_hat = X * b # Fitted values
    rsd = y - y_hat[1:n] # residuals
    rss = sum(rsd.^2) # residual SS
    sig_hat = rss/(n-trA) # residual variance
    gcv = sig_hat*n/(n-trA) # GCV score
    return gcv
end

The function works for various inputs that I test manually. But when I put it through the optimizer I get an error ERROR: SingularException(28) which looks to be coming from the hat matrix calculation. I am not 100% sure if that is because the optimizer reached a pathological region of the search space or if there is something else going on.

hendersontrent commented 5 months ago

Legendary @yahrMason! One question -- do we need the final manual regression code at the end beneath ### Compute GCV, or can we replace with a call to GLM.jl and extract the residual straight from there (especially since the input response variable could be some likelihood other than Gaussian)? I cannot find my copy of Wood's textbook to validate that idea though...

Also, can you please share the code you used to test? Just want to ensure we have the same pipeline of data -> OptimizeGCVLambda which calls GCV.

yahrMason commented 5 months ago

@hendersontrent. Certainly could use GLM.jl. I tried going down that route but couldn't figure out how to extract the hat matrix from a fitted model object, so went manual.

Here is the code I am running:

using RDatasets, DataFrames
using BSplines, Random, LinearAlgebra, Statistics, SparseArrays
using Optim

trees = dataset("datasets", "trees");

x = [trees.Girth, trees.Height]
y = trees.Volume
param = [1,1]
basisArgs = [(10,4), (12,5)]
Basis = []
for i in 1:length(x)
     push!(Basis, QuantileBasis(x[i], basisArgs[i]...))
end
BasisMatrices = BasisMatrix.(Basis, x)
Differences = DifferenceMatrix.(Basis)

# Define GCV Function above

# Code from OptimizeGCVLambda
k = length(BasisMatrices)
lower = zeros(k)
upper = fill(Inf, k)
initial_lambda = fill(1.0, k)

res = Optim.optimize(
    lambdas -> GCV(lambdas, BasisMatrices, y, Differences), 
    lower, upper, initial_lambda, 
    Fminbox(optimizer)
)
hendersontrent commented 5 months ago

@yahrMason yep I can reproduce the singular matrix error now too. I also think we keep it manual/close to the text, at least to start with!