heal-research / HEAL.NonlinearRegression

Fit and evaluate nonlinear regression models.
Other
3 stars 0 forks source link

Hessian of Gaussian NegLogLikelihood - missing hessian of the model #2

Closed folivetti closed 1 year ago

folivetti commented 1 year ago

In this line:

https://github.com/heal-research/HEAL.NonlinearRegression/blob/caae30bdf5a3de36fe97428d22df0c7040124494/HEAL.NonlinearRegression/Util.cs#L176

I think it should be something like:

hess[j, k] += (yJac[i, j] * yJac[i, k] + res * hessModel[j, k]) / (sErr * sErr);
gkronber commented 1 year ago

Yes, I thought the linear approximation is good enough but I'm going to include the Hessian at least to be able to analyse the impact.

gkronber commented 1 year ago

I have extended the term for the Hessian of the model but now the result differs from the result produced by R.

I have compared to Puromycin model in R and with the original code the Laplace approximation produces exactly the same result as nls in R.

Now the question is why R just uses the same approximation and whether we should extend the code.

folivetti commented 1 year ago

from their source code they seem to use only the gradient information https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/nls.R

I'm trying to find any source to whether this is correct or we should really use the hessian

gkronber commented 1 year ago

In "Bayesian Data Analysis" (A. Gelman et al.) and "In all Likelihood" (Pawitan) the "observed Fisher information" is clearly defined as $-\frac{\partial^2}{\partial \theta \partial \theta} \log p(\theta | x)$.

The "expected Fisher information" is defined as the variance of the score function (gradient of log likelihood) $E\theta [(\frac{\partial}{\partial \theta} \log p(\theta | x))^T (\frac{\partial}{\partial \theta} \log p(\theta | x))]$ which (because of the expectation) can be shown to be equal to $-E\theta ( \frac{\partial^2}{\partial \theta \partial \theta} \log p(\theta | x))$ for certain "generality conditions".

Also see https://en.wikipedia.org/wiki/Observed_information and https://en.wikipedia.org/wiki/Fisher_information.

In the Laplace approximation we approximate the peak of the likelihood via a Taylor expansion of the relative likelihood around the MLE. Which implies to use the inverse of the observed Fisher information matrix as the covariance matrix for the Gaussian approximation (see Bayesian Data Analysis).

For a linear model the Hessian is zero everywhere. For a non-linear model the Hessian is typically non-zero. Which implies that we should use the full observed Fisher information matrix including the Hessian of the model.

In any case, the difference is small. It was just nice to have 100% agreement with R.

To make a final decision we should make plots of the Laplace approximation likelihood (both implementations) and the likelihood profile to check which calculation matches the profile better.

folivetti commented 1 year ago

Nice. This seems to correspond to most of the literature I've read so far. I wonder why in R they are using the expected Fisher information. One reason could be that they are satisfied with the Hessian approximation (J'J) as they don't do symbolic derivation of the expression. And, as you noticed, the difference is so small that it doesn't matter that much.

For Puromycin, tho, using J'J leads to results closer to the profile likelihood.

gkronber commented 1 year ago

I made the comparison for Puromycin and Mammography (logistic regression) .

Both approximations deviate quite strongly from the profile likelihood. I cannot say if one or the other is better. The problem are mainly skewed likelihoods because the approximations are symmetric. If the approximation is closed on the one side it will be further away on the other side.

Puromycin:

image

Mammography (simple and full approximation are the same):

image

image

folivetti commented 1 year ago

Given that the literature doesn't seem to have a consensus and nls gives no reason to why it uses J'J, we might just let the user choose as a parameter to our programs. For our paper we may even show both (but if we pick one, I'd say we should pick the Hessian version, as it is supported by the Gelman book)

gkronber commented 1 year ago

I also prefer the full Hessian.