rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
348 stars 31 forks source link

Bias correction when back-transforming from a log #99

Closed rvlenth closed 5 years ago

rvlenth commented 5 years ago

Received from a user...

I was helping one client with a linear mixed model where we had to log-transform a response variable. Her advisor asked if we used bias correction for the back-transform of the estimates. He referenced

Zeng, Wei Sheng and Tang, Shou Zheng. Bias Correction in Logarithmic Regression and Comparison with Weighted Regression for Nonlinear Models. Available from Nature Precedings http://dx.doi.org/10.1038/npre.2011.6708.1 (2011) http://precedings.nature.com/documents/6708/version/1

which also references

Baskerville, G. L. (1972). Use of Logarithmic Regression in the Estimation of Plant Biomass. Canadian Journal of Forest Research, 2(1), 49–53. https://doi.org/10.1139/x72-009 https://www.nrcresearchpress.com/doi/10.1139/x72-009#.XK38T9h7nIU

Both recommend the correction factor E(exp(x))=exp(mu + var/2). When I brought it up at our weekly staff meeting, one of our faculty consultants derived that same correction factor on the board without seeing the paper. So perhaps it is something we should be doing when providing point estimates for models with a log-transformed response?

It appears that emmeans with type=”response” on a model with a log transform does a straight back transformation as exp(mu), without implementing this correction. This also happens in JMP, which by default provides the back transformation on least squares means if you transform the response within the model platform.

Can you provide any comments if you recommend this correction factor, and if you have considered making it an option in emmeans?

rvlenth commented 5 years ago

This seems related to the fact that the lognormal distribution with parameters mu and sigma has mean exp(mu + sigma^2/2). At first appearance, this troubles me, and makes me wonder what it is we are trying to estimate and whether this is the right correction. Suppose for example that I have 5 iid observations Y_1,...,Y_5 from a lognormal. I compute mu-hat as the average of log Y_i; then this estimate has variance sigma^2/5 so that exp(mu-hat) has expectation exp(mu + sigma^2/10). If I apply this bias correction, I guess I get something more like exp(mu + sigma^2/10 + sigma^2/2) = exp(mu + 6sigma^2/10).

Anyway, what is it I am trying to estimate? exp(mu)? exp(mu + sigma^2/2) = E(Y)? We have the same positive multiplicative bias either way.

rvlenth commented 5 years ago

Just a note to say I ran into this same topic in a whole different context -- in trying to understand some methods used with the MCMCglmm package. See its associated "course notes" vignette, https://cran.r-project.org/web/packages/MCMCglmm/vignettes/CourseNotes.pdf, and look at section 2.5, in particular, (2.11) on p.46. That source seems to suggest that the correction is needed because of non-normality, but I think it's really because Poisson regression is usually done with a log link. In that context, if we want to predict expected count "after marginalising with respect to the random effects" (phraseology from these notes), we need to apply the same correction that you described.

This both confuses and enlightens me. From my usual point of view, I want to transform a response so that I get a better model, and if the predictors are factors as is typically true in contexts where emmeans is used, I think of it in terms of parameters that are cell means and such. If a log transformation was used, then from that perspective I am comfortable with talking about back-transforming those parameters; and no correction is needed to talk about those as I am just translating the results to understandable units.

But the other perspective is that I may want to estimate how much money I'll make with a particular set of treatments (sales strategies or whatever), and I transformed the sales data using logs. Then I do want to apply the correction, else I'll be under-estimating my average sales. Here's a simple, illustrative example with simulated data:

> y = exp(rnorm(100))   ### my "data"
> mean(y)   ### the mean of my data
[1] 1.517303

> exp(mean(log(y)))   ### uncorrected back-transformed mean
[1] 0.9817702

> exp(mean(log(y)) + .5*var(log(y)))   ### corrected back-transformed mean
[1] 1.507465

I'm thinking it would indeed be nice to offer ways of correcting for back-transforming when one really wants to estimate the mean response. How best to do that is another question, as there are a lot of transformations besides logs. (Take a look at how icky it gets with a logit transformation, equations (2.13) and (2.14) in those same notes). If you have any suggestions, I'm all ears. Right now, I'm thinking I might want to at least allow the user to specify the error variance to correct for; and hope I can do some kind of Taylor approximation or simulation or something to get at the needed correction for an arbitrary transformation.

rvlenth commented 5 years ago

some bias-correction methods are now implemented. No real need to keep this open, but more may be coming.