statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.03k stars 2.88k forks source link

Robust linear model: divide by zero error for perfect fits #3319

Closed SamNPowers closed 5 years ago

SamNPowers commented 7 years ago

If you run this python code:

import statsmodels.api as sm

x = [80, 70, 60, 50, 40, 30, 20, 10, 0]
y = [-4, -5, -6, -7, -8, -9, -10, -11, -12]

x_with_const = sm.add_constant(x)
rlm_model = sm.RLM(y, x_with_const, M=sm.robust.norms.TukeyBiweight())
rlm_results = rlm_model.fit()

the results is a divide by zero exception in robust_linear_model.py. This happens even if there are outliers -- as soon as the algorithm decides to remove them, the divide by zero comes back. Eg the same code as above except for:

x = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
y = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,1000]

still results in the divide by zero error.

I'm running statsmodels 0.6.1.

I edited robust_linear_model.py to make the couple offending scale locations (_estimate_scale() and deviance()) have logic basically like:

if scale == 0:
   scale = sys.float_info.epsilon

and this fixed the issues, but this seems more like a workaround than a real fix.

josef-pkt commented 7 years ago

I'm putting temporarily prio-high.

I tried to solve this problem some years ago, instead of fixing the symptoms. But then I got confused about the perfect fit policies or behaviors, and was not sure what to do with it. #1341 I was recently again reading books and articles that argue that the perfect fit property with variance=0 is a feature and not a problem, but I don't "like" it. It wasn't clear to me how to add user options for it.

edit perfect fit happens if at least half of the observations are on a line or subspace. This happens because we use MAD for the scale estimate. If we don't want a breakdown point of around one half, then the scale should be adjusted. But Huber scale also did not look very plausible.

The point is essentially to decide whether, for example, the x.5 y-values in this case are outliers or not: x = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] y = [1,2,3,4,5,6,7,8,9,10,11,12, 12.5, 14.5, 15.5, 16.5] or to decide whether the code should make a decision or should ask the user.

SamNPowers commented 7 years ago

Perhaps the way I would expect this to show up is in the fit_history. If a perfect fit resulted in NaNs in the fit history, I could detect this and go back in the history to the most recent fit that worked for me. If the weights were also saved off in fit history, I could do some basic decision making based on what fraction of weights are non-zero.

The main issue I'm having right now is the fit() blows up completely, so I don't even have access to the history to make a decision for myself.

I guess it would also be my preference would be for the perfect fit case to have an epsilon scale such that I would never see NaNs, but that's might be mostly for practical reasons for my particular use case.

josef-pkt commented 7 years ago

@SamNPowers What kind of data do you have to get perfect fit? And how many observations?

I think that saving the weights in the history will not be practical, because it looks too much like a memory leak, i.e. we are adding a full nobs length array in each iteration.

The main workaround for the perfect fit problem that I thought about is to add a tiny bit of noise to the data which would put the noise standard deviation as a lower bound on the scale.

There are currently some limitation on the scale options, and I just saw another bug with an unused scale option that would allow users to provide a scale function or callable.

From what I can see now (being 3 years older than in the previous round), we need to

SamNPowers commented 7 years ago

The data I'm using to get a perfect fit is simulated data - we need to have a set of data with a well-known fit to use for testing some of our other logic. I considered adding noise to the data but that seemed like a hack around the issue.

I am very in favor of allowing the user to specify a lower bound on scale.

Thanks for being so responsive about this!

josef-pkt commented 7 years ago

Thinking about it some more, I start to like the idea of adding an epsilon lower bound on the scale by default, that can be set to zero in a new fit keyword. This would be independent of the "proper" handling of the perfect fit case.

I guess the epsilon lower bound would be a interior solution that is numerically close to the full perfect fit solution. @SamNPowers Did all the numbers look reasonable with your epsilon perfect fit, or does it look like we need to adjust other parts?

I'm not planning to work on the full solution for some time, but I think it should be possible to quickly get some changes merged to avoid the current exceptions and give the user some options as workarounds.

josef-pkt commented 7 years ago

BTW: I'm glad to get some feedback on this, because last time I either couldn't figure out what should be done or I couldn't make up my mind about which solution would be best. (And I spend large parts of the last 2 months reading again about outlier robust estimation although with a different focus, mainly on multivariate models.)

SamNPowers commented 7 years ago

The two main scenarios I tested were: 1. a simple line, 2. a simple line with a couple outliers. Both of them gave me the robust fit parameters I expected with just my simple scale lower bound. It's not exactly a stress test, but it at least indicated to me that there weren't other issues.

josef-pkt commented 7 years ago

original issue #55 (It's the 6th oldest issue, and resurfaces every few years. Some of the oldest issues have been fixed over time but not been closed.)

SamNPowers commented 7 years ago

Do we have an estimate on when the code to avoid the current exceptions + give the user some workarounds might be in?