statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.98k stars 2.87k forks source link

TSA: outliers, robust estimation, inference #3285

Open josef-pkt opened 7 years ago

josef-pkt commented 7 years ago

related to #943

943 and related discussion is on how to handle outliers. AFAICS, the main approach for outlier detection and handling for ARMA and similar models are based intervention models, and "oxford" dummy saturation. AFAIK, this ignores the impact of the outlier search on inference.

Soren Johansen and Bent Nielsen have a recent or current sequence of papers that includes asymptotic theory and is closer to the robust literature, e.g.

Johansen, Søren, and Bent Nielsen. 2016. “Asymptotic Theory of Outlier Detection Algorithms for Linear Time Series Regression Models.” Scandinavian Journal of Statistics 43 (2): 321–48. doi:10.1111/sjos.12174.

Johansen, Søren, and Bent Nielsen. 2013. “Outlier Detection in Regression Using an Iterated One-Step Approximation to the Huber-Skip Estimator.” Econometrics 1 (1): 53–70. doi:10.3390/econometrics1010053. http://www.mdpi.com/2225-1146/1/1/53

This is just a reference for now (so we find it again), I have no idea when we will get into this neighborhood. The priority should be to implement the traditional methods for outlier handling in tsa models with focus on prediction and less emphasis on inference. (Quality of forecast standard errors has not been a top priority for anyone either.)

an issue that might be relevant even for traditional methods is the possible difference in convergence or asymptotic distribution depending on one-step WLS, one-step Newton and iterated solutions. see #3273

as an example: unit root tests under outlier removal (just quick skimming of the second paper, 2013, above: it has the Brownian motion based distribution for the AR(1) coefficient for/under unit root as an example.) (related: somewhere I have seen articles on robust VAR and cointegration)

Kevin-McIsaac commented 7 years ago

In a recent project I had to find the outliers for a large number of time series. I discovered the typical process is first remove any trend and/or seasonal component to create 'residues'. I used smf.ols residues from a linear model but this could be extended to use ARMA etc.

In many cases the residues will be near normal so you can use a robust measure of dispersion (ie., based on mean/MAD or IQR) to identify the outliers.

While simplistic compared to the proposal above, it worked well in my analysis and is a simple place to start.

I wrote some code for panda Series, created a basic notebook and proposed an enhancment in panda-dev. The recommendation was to put this in statsmodels/

Would you be interested in adding something like this? If so I can rework it to fit better with statsmodels.

josef-pkt commented 7 years ago

@Kevin-McIsaac Yes, I think it would fit into statsmodels. I didn't have time to look at it today, and I will try to provide more information later. I think what we could target is a user friendly class for the specific purpose, and add options for different methods of outlier detection. This way we could start with simple methods and expand the available methods over time.

one details: I would use RLM Instead of OLS in the simple case based on linear regression.

Also, given that it should be user friendly, I would add some imputation methods for nan or missing data by default, with options to choose from, if imputation is not tied in directly with the outlier identification.

Kevin-McIsaac commented 7 years ago

Ok, I'll replace olm with rlm and update my code to fit statsmethods, then I'll create a PR. Since I'm new to this I'd appreciate critical feedback on the implementation.

I've already got a function for replacing outliers with NaN or interpolated data which I'll include. There is also and a time series plotting function that I found useful that we can look at.

On Thu., 16 Feb. 2017, 9:57 am Josef Perktold, notifications@github.com wrote:

@Kevin-McIsaac https://github.com/Kevin-McIsaac Yes, I think it would fit into statsmodels. I didn't have time to look at it today, and I will try to provide more information later. I think what we could target is a user friendly class for the specific purpose, and add options for different methods of outlier detection. This way we could start with simple methods and expand the available methods over time.

one details: I would use RLM Instead of OLS in the simple case based on linear regression.

Also, given that it should be user friendly, I would add some imputation methods for nan or missing data by default, with options to choose from, if imputation is not tied in directly with the outlier identification.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/3285#issuecomment-280168301, or mute the thread https://github.com/notifications/unsubscribe-auth/AP1kc7f63YrTXfFiiUsAvsbQ0Duo72FUks5rc4LDgaJpZM4K3vNV .

--

Regards,

Dr Kevin McIsaac 0411 865 062

kris-singh commented 7 years ago

I would also like to work on this if possible. I would look into those papers and see what i understand.