rcorty / dglm

Double generalized linear models
0 stars 0 forks source link

possible error in algorithm #4

Open rcorty opened 5 years ago

rcorty commented 5 years ago

e-mailed in by Andrew Weiss:

zm <- as.vector( eta + (y - mu) / family$mu.eta(eta) )
wm <- as.vector( eval(family$variance(mu))*weights/phi )

should be replaced with

zm <- eta + (y - mu) / family$mu.eta(eta)
fam.wt <- expression( weights / family$variance(mu) )
wm <- eval( fam.wt ) / phi * family$mu.eta(eta)^2
rcorty commented 5 years ago

title: "Comments on dglm" output: html_document: default pdf_document: default

\usepackage{setspace} \usepackage{amsmath}

Andy Weiss \ 6 August 2017

I have had a look at the code you sent me earlier in the week, and at dglm more generally (using a version of the code downloaded from cran).

  1. dglm works for the linear model with heteroskedasticity in your example. The problem in your code is the summary command for the dispersion model:

$\hspace{3 cm}$ b_ses[rep_idx] <- coef(summary(fitted_dglm$dispersion.fit))['b', 'Std. Error']

The summary command there calls a version of summary that does not impose the condition dispersion = 2. Instead, it estimates the dispersion, giving a result $\approx$ 4 in this model, about twice as big as it should be. Hence, the standard errors in b_ses are too big by a factor of about sqrt(2).

Replacing the above line with:

$\hspace{3 cm}$ b_ses_new[rep_idx, 1] <- summary(fitted_dglm)\$dispersion.summary\$coefficients[4] \

$\hspace{3 cm}$ b_ses_new[rep_idx, 2] <- summary(fitted_dglm)\$dispersion.summary\$coefficients[3]

does the trick.

I also tested with a slightly different version of the model,

$\hspace{3 cm}$ a <- sample(x = c(0, 1, 2), size = num_obs, replace = TRUE) \

$\hspace{3 cm}$ b <- 2/3 * a + 1/3 * sample(x = c(0, 1, 2), size = num_obs, replace = TRUE)

mostly to make the variances correlated with X ('a' in your notation). That is because the usual OLS standard errors are OK if there is no correlation between the variances and X, so any problems may be less likely to show up.

  1. dglm does not work when a log link is used in the mean model, as in

$\hspace{1 cm}$ fitted_dglm <- dglm1(formula = y ~ a, \

$\hspace{4 cm}$ dformula = ~ b, \

$\hspace{4 cm}$ family = stats::gaussian(link='log'), \

$\hspace{4 cm}$ dlink = 'log', \

$\hspace{4 cm}$ method = 'ml', \

$\hspace{4 cm}$ mustart = expected_value)

This corresponds to the model

$\hspace{3 cm}$ y = exp(x'beta) + normally distributed error $\tag{1}$

This is a bit of a strange model, but is part-way to the Gamma model I started with.

In the log-link model, something like the mustart = expected_value above is needed when running dglm to stop dglm from using log(y) when it gets initial values for mu (via eval(family\$initialize))

The calculation of mfit$null.deviance towards the end of dglm also wants to take log(y), which is avoided by adding mustart=mustart to the glm.fit call that calculated the deviance.

More importantly, I believe that there is an error in the coding of the scoring algorithm, so that the algorithm does not always work properly. The current algorithm sometimes diverges in the log-link model and when it does converge, it does not necessarily give the MLE's. In the Gamma model I started with, it did give the MLE's (which in that model are just sample means), but the coding error showed up in the standard errors. In the linear model with log link and heteroskedastic errors, it shows up in the standard errors of the mean parameters and in the parameter estimates of the dispersion model.

In the log-link model in equation (1) above, the likelihood function is given by the sum of

[ \begin{eqnarray} log f = \frac{y \mu - \mu^2}{\sigma^2} - \frac{y^2}{2\sigma^2} - log \sigma \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \tag{2} \end{eqnarray} ]

where $\mu = exp(x'\beta)$ and $\sigma^2 = exp(z'\gamma)$, and I have dropped $i$ subscripts. Hence, the contribution to the gradient vector is

[ \begin{eqnarray} \frac{\partial log f}{\partial \beta} = \frac{1}{\sigma^2} (y - \mu) \frac{\partial \mu}{\partial \eta} \frac{\partial \eta}{\partial \beta} \tag{3} \end{eqnarray} ]

where $\eta = x'\beta$ and $\frac{\partial \eta}{\partial \beta}= x$, and the contribution to the Hessian is

[ \begin{eqnarray} -\frac{\partial^2 log f}{\partial \beta \partial \beta'} = \frac{1}{\sigma^2} \left(\frac{\partial \mu}{\partial \eta}\right)^2 \frac{\partial \eta}{\partial \beta}\frac{\partial \eta}{\partial \beta'} = \frac{1}{\sigma^2}\left(\frac{\partial \mu}{\partial \eta}\right)^2 x x' \tag{4} \end{eqnarray} ]

The relevant code in dglm is

$\hspace{3 cm}$ zm <- eta + (y - mu) / family$mu.eta(eta) \

$\hspace{3 cm}$ fam.wt <- expression( weights * family$variance(mu) ) \

$\hspace{3 cm}$ wm <- eval( fam.wt ) / phi \

$\hspace{3 cm}$ mfit <- stats::lm.wfit(X, zm, wm, method = "qr", singular.ok = TRUE)

In the gaussian family, variance(mu)= 1, so the X' W zm matrix in that lm.wfit (where W = diag(wm)) corresponds to a gradient contribution of

[ \begin{eqnarray} \frac{1}{\sigma^2} (y - \mu) / \frac{\partial \mu}{\partial \eta} * \frac{\partial \eta}{\partial \beta} \tag{5} \end{eqnarray} ]

where $\frac{\partial \mu}{\partial \eta}= \mu$, which differs from (3); and the X' W X matrix in that lm.wfit corresponds to a Hessian contribution of

[ \begin{eqnarray} \frac{1}{\sigma^2} x x' \end{eqnarray} ]

which differs from (4). The same code is used in the initialization steps, written as

$\hspace{3 cm}$ zm <- as.vector( eta + (y - mu) / family$mu.eta(eta) ) \

$\hspace{3 cm}$ wm <- as.vector( eval(family$variance(mu))*weights/phi )

I believe the code should be

$\hspace{3 cm}$ zm <- eta + (y - mu) / family$mu.eta(eta) \

$\hspace{3 cm}$ fam.wt <- expression( weights / family$variance(mu) ) \

$\hspace{3 cm}$ wm <- eval( fam.wt ) / phi * family$mu.eta(eta)^2

which corresponds to the algorithm in Smyth (1989).

Note that in the model with identity link, $\frac{\partial \mu}{\partial \eta}= 1$ (family$mu.eta = 1), and the current code in dglm works.

The expected value of the gradient vector based on equation (5) is equal to zero at the true parameters,

[ \begin{eqnarray} E \left[ \frac{1}{\sigma^2} (y - \mu) / \frac{\partial \mu}{\partial \eta} x \right] = 0, \end{eqnarray} ]

so the algorithm as coded still gives consistent estimators of the parameters of the mean equation when link = log. But the estimator is less efficient than the MLE and the standard errors based on the X' W X matrix are wrong when $\frac{\partial \mu}{\partial \eta} \ne 1$.

The problem with the lm.wfit may transfer into the dispersion model via the fitted values

$\hspace{3 cm}$ eta <- mfit$fitted.values \

$\hspace{3 cm}$ mu <- family$linkinv(eta+offset) \

$\hspace{3 cm}$ d <- family$dev.resids(y, mu, weights)

In the gaussian model, small changes in the fit of the mean model may translate into large changes in $d=(y - \mu)^2$ and hence affect the dispersion parameter estimates. For example, with the variable called 'a' sampled from c(0,1,2) and mean_effect = 3, max(exp(x'beta)) = max(exp(a mean_effect)) = exp(23) = 403 and the algorithm as coded is poorly behaved.

Also, the dispersion estimate in summary.dglm uses the residuals from the lm.wfit, weighted by wd:

$\hspace{3 cm}$ sum((object\$weights*object\$residuals^2)[object\$weights > 0]) / df.r

In the current code, that translates into

$\hspace{3 cm}$ sum( (1/phi) * ((y - mu) / family$mu.eta(eta))^2 ),

a statistic that has no particular meaning; whereas in the correct code, it translates into

$\hspace{3 cm}$ sum( (1/phi family\$mu.eta(eta)^2) (y - mu)^2 / (family\$mu.eta(eta))^2 ) \

$\hspace{5 cm}$ = sum( (1/phi) * (y - mu)^2 ) \

$\hspace{5 cm}$ $\approx$ 1

Then the covariance matrix of the MLE of the mean parameters is close to the inverse of the X' W X matrix in the lm.wfit, which is given by (X' W X)$^{-1}$ where

$\hspace{3 cm}$ W = diag((1/phi * family$mu.eta(eta)^2)).

That corresponds to the inverse of the Hessian in equation (4).

The same issues arise in the Gamma model, and I believe that the revised code works in that model.