statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python

http://www.statsmodels.org/devel/

BSD 3-Clause "New" or "Revised" License

9.96k stars 2.86k forks source link

SUMM/ENH: measurement errors with known distribution #3187

Open josef-pkt opened 8 years ago

josef-pkt commented 8 years ago

(another topic where I don't have an overview and no ready answers)

1097 for linear model (simple cross-sectional)

scipy also has orthogonal distance regression (where I never got beyond one toy example)

in the case of descriptive statistics: some comments https://github.com/statsmodels/statsmodels/issues/3186#issuecomment-244441367

Most of repeated measures, cluster, panel still assume the exog are observed, but random intercept becomes similar for repeated measurements of endog. multivariate models including models like VAR have noise in all variables. Factor analysis, latent variables and similar (statespace models, system of equations) are tools at least for the everything-is-normally distributed case.

special case: deconvolution with known second distribution.

How do these things fit together, and what are the functions and models that we need?

application area mostly engineering and natural sciences with "hard" measurements and known measurement error of (real) instruments. (I think it doesn't show up much in social sciences because there is a huge amount of "noise" everywhere and "endogeneity" problems are the main analogue of measurement error.)

semi-random related areas

computer simulation with known simulation error, e.g. bootstrap. (like surface fitting for obtaining the asymptotic distribution or p-values)
sampling from a finite population, survey sampling: weighting depends on sampling scheme to recover the noisily observed population statistics. (but weight is observation and not variable dependend)
item response models: multiple measurements on underlying latent factors, (I don't know much about this)

josef-pkt commented 8 years ago

These two books look good Fuller for linear models, and Carroll et al for GLM and similar. Caroll at al have monte carlo averaged estimating equations (using random complex numbers), and conditional corrected score functions that look similar to control functions for endogeneity correction in econometrics.

Wayne A. Fuller: Measurement Error Models 1987

Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, Ciprian M. Crainiceanu 2006 Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition

most articles that I looked at are not very exciting (which the citation counts reflect, but might be interesting enough for some examples or special cases).

josef-pkt commented 8 years ago

Dear, Keith B. G., Martin L. Puterman, and Annette J. Dobson. 1997. “Estimating Correlations from Epidemiological Data in the Presence of Measurement Error.” Statistics in Medicine 16 (19): 2177–89. doi:10.1002/(SICI)1097-0258(19971015)16:192177::AID-SIM6463.0.CO;2-N.

edit: the following is the wrong reference, the above is the correct one Molarius, Anu, Richard W. Parsons, Annette J. Dobson, Alun Evans, Stephen P. Fortmann, Konrad Jamrozik, Kari Kuulasmaa et al. "Trends in cigarette smoking in 36 populations from the early 1980s to the mid-1990s: findings from the WHO MONICA Project." American Journal of Public Health 91, no. 2 (2001): 206.

I haven't looked much at the multivariate case or MLE in Fuller. the above article has only 6 citation in google but looks fine for the bivariate normal case (without equation error). It's more explicit on estimating correlation and has comments on optimization problems in appendix. Fuller has similar cases for regression equations without equation error but known measurement error variance in y. (but I didn't look closely and Fuller is focused on estimating \beta and not the other parameters like correlation)

josef-pkt commented 8 years ago

Spiegelman, Donna. "Approaches to uncertainty in exposure assessment in environmental epidemiology." Annual review of public health 31 (2010): 149.

A interesting survey or review article pointing out the importance of measurement error correction exampleas for estimating the effect of a continuous exposure/treatment on binary outcomes (odds ratio or risk ratio for a discrete change in the continuous treatment) uses regression calibration: there is one chapter in Caroll et al book, but I didn't read it. i.e. trying to get the relationship between true exposure and exposure with measurement error (surrogate) from a validation study.

(aside: I still find the assumptions-theorem-proof-examples pattern by econometricians and some statisticians easier to follow than when everything is mixed together.)

josef-pkt commented 8 years ago

(and maybe the last one in this series) a recent publication with "known" authors that combines measurement errors and misclassification of binary variables (not read, but maybe a starting point for recent literature, and uses several correction methods as alternatives) it is for GLM

Yi, Grace Y., Yanyuan Ma, Donna Spiegelman, and Raymond J. Carroll. 2015. “Functional and Structural Methods With Mixed Measurement Error and Misclassification in Covariates.” Journal of the American Statistical Association 110 (510): 681–96. doi:10.1080/01621459.2014.922777.

unrelated to this reference: multivariate normal: Stata has measurement error models also in SEM Dear et al (1997) (edited reference in earlier comment) looks a bit "fishy" when the estimated correlation is 1 but scatter doesn't look close to a line. see also the following for criticizing orthogonal regression because it ignores equation error Carroll, R. J., and David Ruppert. 1996. “The Use and Misuse of Orthogonal Regression in Linear Errors-in-Variables Models.” The American Statistician 50 (1): 1–6. doi:10.2307/2685035. and a similar section is in the book Carroll et al. 2006 (Gaussian/normality assumptions are fine in many cases, but in other cases the results are not robust to misspecification, e.g. two-sample Bartlett test for variance equality. What if your model is not compatible with the data, e.g. correlation/covariance matrix is not positive definite, and it's not just noise?)

other packages: Stata also has eivreg for errors in variables with specified reliabilities (measurement error variance as fraction of observed variable variance) but it looks pretty basic. I haven't checked what's available in R. Spiegelman (2010) has SAS macros for her regression calibration in logit/log-binomial models. Spiegelman was pointing out lack of software and lack of training as one reason for the low use of measurement error models in some fields like environmental epidemiology.

josef-pkt commented 8 years ago

Yi et al 2015 have a good list of references and list two more text books additional to Fuller, and Caroll et al. Buonaccorsi 2010 is the most recent.

(I only read the setup in the Yi et al paper) The assumption is that there is a external validation sample from which we can estimate with logit and linear regression the conditional distribution that includes the error process. In the main sample the variable with measurement can, then be "replaced" with the conditional error corrected version. (kind of) That starts to sound very similar to instrumental variables problem where we have an extra IV regression instead of the validation sample regression.

In any case, measurement error models looks like a topic as big as (outlier) robust models. (and I'm not going to read now large parts of 4 text books)

Caroll and Buonaccorsi books are both focused on regression models, and don't have multivariate statistics (unless I missed those in browsing)

what's feasible to get started: method of moment correction in the style of Fuller, Fuller still seems to be the main reference for the linear case. Buonaccorsi 2010 refers to it with useful comments to put it in perspective, but without providing a duplication of Fuller's presentation and derivation. Buonaccorsi has the relevant parts a bit spread out across chapters and sections.

handling external validation samples might fit better after having IV for GLM/LEF (and control function approaches).

finally: The above is "statistics", I saw some abstracts using GMM including GMM with higher moments and there is also still something more recent in econometrics (I'm not sure what the status is there). GMM might also be useful as a tool to combine estimating equations.

josef-pkt commented 8 years ago

and one of the finally

Yi, Grace Y., Yanyuan Ma, and Raymond J. Carroll. 2012. “A Functional Generalized Method of Moments Approach for Longitudinal Studies with Missing Responses and Covariate Measurement Error.” Biometrika 99 (1): 151–65. doi:10.1093/biomet/asr076.

linking up adjusted score functions (moment conditions) and IPW for missing endog, advertised as simple and without needing strong assumptions. setup is GLM, and so similar to GEE, but uses GMM theory "Our method has a number of appealing properties: assumptions on the model are minimal, with none needed about the distribution of the mismeasured covariate; implementation is straightforward and its applicability is broad."

josef-pkt commented 8 years ago

some more details and links

Tom Wansbeek, Erik Meijer. 2000 "Measurement error and latent variables in econometrics" (book) As we will sketch below, the smallest positive eigenvalue A of equation (5.22) is a consistent estimator of sigma. The properly scaled corresponding eigenvector converges in probability to (1, -beta')'. (not exact quote, incorrectly edited copy-paste errors) page 101 relationship between univariate endog regression and eigenvalues or one-factor factor analysis. It shows up in several places, including small sample correction in Fuller, but I haven't tried to figure out what's the connection.

Stata had a project in 2003 joint with Carroll and Hardin http://www.stata.com/merror/ but I don't see what happened to it. It looks outdated and possible incompatible with newer computers.

Carroll has a webpage for the textbook including some data http://www.stat.tamu.edu/~carroll/eiv.SecondEdition/index.php

josef-pkt commented 5 years ago

another simple stackoverflow example https://stackoverflow.com/questions/52291798/how-to-account-for-error-bars-in-a-linear-regression

josef-pkt commented 5 years ago

I found an old script file \scripts\corr_measerror.py from 2016-09-28 which seems to be the only code I wrote.

def corr_corrected(data, sigma_merr=None, var_merr=None, reliability=None,
                   method= 'simple', het_weights=None, ddof=1):
    """correct covariance and correlation for measurement error

    measurement error can be specified in three alternative ways

We could use the corrected moment matrices directly if we had the OLS/WLS from summary statistics #3901

josef-pkt commented 5 years ago

(While I was waiting in the dentist's office, ...)

I have a different version of a short course by Caroll on my iPad, which has an overview for the linear model before going into nonlinear models in more details https://www.stat.tamu.edu/~carroll/talks/Carroll_Short_Course.pdf which refers to Carroll, Raymond J., David Ruppert, Leonard A. Stefanski, and Ciprian M. Crainiceanu. Measurement Error in Nonlinear Models: A Modern Perspective. CRC Press, 2006.

(but the course slides that I have, refer to the first edition 1995)

update

caroll's talk pdf files and slides are not available anymore (dead links) above slides are still available https://web.archive.org/web/20160610190709/http://www.stat.tamu.edu/~carroll/talks/Carroll_Short_Course.pdf and the list in https://web.archive.org/web/*/https://www.stat.tamu.edu/~carroll/talks/*

talk list current page with dead links https://carroll.stat.tamu.edu/talks/

josef-pkt commented 1 year ago

where is my code? AFAIR, I spent a month or three on this. Hint, my comment above about Fuller's book is from Sep 7, 2016.

as motivation:

see comments by Roland in https://stats.stackexchange.com/questions/416977/regression-and-calibration-inverse-regression-the-same

e.g. " Standards in your case are samples with known concentration. "known" means measured by a more precise method than the one to be calibrated. (For our gas chromatographs we buy such standards from a commercial supplier. They certify the concentration to an uncertainty that is negligible compared to the measurement uncertainty of our instruments. You can not mix gas standards with sufficient precision for calibration in most research labs. So, if you don't have a very precise gas analyzer, I'd say you should buy your standards.) "

found via https://stats.stackexchange.com/questions/602663/calculating-uncertainty-of-predictions-standard-error-or-error-calculus

josef-pkt commented 11 months ago

another question https://stackoverflow.com/questions/77281318/linear-regression-with-errors-in-x-and-y-in-python

R also has a package with eivreg https://cran.r-project.org/web/packages/eivtools/ (not updated since 2018 but still available)

I guess we would need stripped down model and results classes so we don't inherit from regression models those features that we cannot verify. example: get_prediction and prediction intervals, diagnostics, ... base structure of model similar to WLS, with and without fixed scale Code will also need to take care of constant where we don't have measurement errors. Do we work with moment matrices X'X or covariances Cov(X) with separate handling of means and constant?

(I don't remember enough and my prototype code seems only to look at corr x'x with measurement error. So this will not be very quick to implement.)

josef-pkt commented 11 months ago

I'm not how this applies in the recent stackoverflow question

AFAIR, the standard assumption is that measurement errors are homoscedastic, ie. all observations have the same measurement error distribution. In the example we have the measurement error standard deviation for each observation. So, we would need to adjust individual observations similar to WLS with var_weights for y-heteroscedasticity. (I don't remember whether I looked at that)

josef-pkt commented 10 months ago

reference with good summary presentation

Meijer, Erik, Edward Oczkowski, and Tom Wansbeek. “How Measurement Error Affects Inference in Linear Regression.” Empirical Economics 60, no. 1 (January 1, 2021): 131–55. https://doi.org/10.1007/s00181-020-01942-z.

mainly of interest 2 models: known measurement error variance and known reliability cov_params: HC sandwich form and some under normality assumptions. Note: old Stata versions <16 had incorrect cov_params (*) includes score_obs

extension to heteroscedastic "equation error": Fuller 1987 book p. 194 equ. 3.1.26 It also does not assume that measurement error variance is constant across observations. cov_params under normal errors (all or only measurement errors?) equ. 3.1.27, generic sandwich form equ. 3.1.12 section of equ. 3.1.12 has WLS form, i.e. given weights in regression

This should be enough to get started with the core parts of the models (based on statistics/econometrics literature)

Related: I have not figured out what physics, astronomy, chemometrics and similar are doing or I don't understand why they are doing what they are doing. (most articles are for simple regression, i.e. one regressor, and no or not much statistical properties) WLS where weights include covariate measurement error seems to be often used, but I have not seen yet whether that provides consistent estimates (or under what assumptions it does)

more general approach: corrected score equations: adjust score/estimating functions to correct for measurement error new issue for GLM/LEF with canonical link (gaussian, poisson) with explicit corrected score functions. #9030 in gaussian case corrected score is the same as moment method above (Meijer, Fuller, ...)

(*) Lockwood, J. R., and Daniel F. McCaffrey. “Recommendations about Estimating Errors-in-Variables Regression in Stata.” The Stata Journal 20, no. 1 (March 1, 2020): 116–30. https://doi.org/10.1177/1536867X20909692.

josef-pkt commented 10 months ago

(hopefully final pieces for basic understanding)

WLS where weights include measurement error variance is Berkson case (which I always skipped in my readings) OLS is still consistent in this case. short summary of Berkson case in Wansbeek and Meijer book p. 30 (with brief example, company sets price but does not know what prices consumers actually pay, estimate demand function only knowing the company price)

tbc

update two different versions of WLS, measurement error included in weights Tellinghuisen has EV1 and EV2 for weighted least squares EV1: standard WLS iterations with updating of weights. (moment conditions are WLS and assumes weights are fixed), same as OLS if weights are the same across observations, as above EV2: take dependence of weights on params into account, moment conditions have extra term with derivative of weights/variance w.r.t. slope parameters. (This is similar to GLS if variance function directly depends on mean parameters and we take this into account in the moment conditions for mean parameters, mean and variance function have common parameters) The original commonly cited article is Lisý et all 1990. The extra term changes the "working y" similar to working residuals in GLM-irls. (I have not yet seen any theory for statistical properties like consistency of OLS/WLS in this case.)

Tellinghuisen, Joel. “Least Squares Methods for Treating Problems with Uncertainty in x and y.” Analytical Chemistry 92, no. 16 (August 18, 2020): 10863–71. https://doi.org/10.1021/acs.analchem.0c02178.

Lisý, J. M., A. Cholvadová, and J. Kutej. “Multiple Straight-Line Least-Squares Analysis with Uncertainties in All Variables.” Computers & Chemistry 14, no. 3 (January 1, 1990): 189–92. https://doi.org/10.1016/0097-8485(90)80045-4.

consequence: we need extra model for EV2, i.e. non-separable variance in WLS.

josef-pkt commented 10 months ago

possible class names

in regression MeasurementErrorMMSimple linear model with known measurement error variances or reliability (?) constant over observations estimated by method of moments (Fuller)

MeasurementErrorMM similar but with observation specific measurement errors

options maybe within classes (instead of adding more subclasses)

diagonal measurement error (independent across regressors and endog) versus full measurement covariance computationally mainly important if we need to compute with observation specific (nobs x k x k) measurement errors
with or without endog measurement error, (maybe also with or without correlation of endog m errors with exog m errors
cov_type: robust sandwich (M-estimator) or normality and homoscedasticity assumption sandwich should be generic code (with score_obs), nonrobust requires longer linear algebra expressions