Open nickeubank opened 3 years ago
Oh, wow -- I think I solved it. LOESS is calculating local medians, not means.
I re-did my grouped means as grouped medians and plotted, and I get this:
Hmm. For reference, here is the LOESS implementation used by Vega / Vega-Lite / Altair: https://github.com/vega/vega/blob/master/packages/vega-statistics/src/regression/loess.js
The calculation involves performing a series of OLS linear regressions over a sliding window of weight-adjusted points. There is no direct median calculation within this. (You will see a median calculation in the source code applied to residuals -- it is used to determine stopping conditions and perform weight adjustments -- but it is not applied directly in the regression itself.) I think what is happening here is that this robust weight adjustment is canceling out the contribution of outliers in the data, hence the better match to grouped medians than grouped means.
One sanity test is to plot all the data as points -- not grouped means/medians -- and see how that looks (allowing outliers to be seen more clearly). If you have heavy skew / large outliers, grouped means and OLS linear regression may not be the best choices for your data!
If you have heavy skew / large outliers, grouped means and OLS linear regression may not be the best choices for your data!
Sure -- this isn't really about that. I was just doing some data explorations when I discovered this, and want to understand it.
For context, here's the result of the Stata implementation of Loess on this data:
sort years_w_decimals
lowess arrest_rate years_w_decimals , nograph gen(yhat)
graph twoway line yhat years_w_decimals
And here's R:
l1 = loess(arrest_rate ~ years_w_decimals, df)
result = predict(l1)
plot(result, x=df[!is.na(df$arrest_rate),"years_w_decimals"], col="red")
All basically match the regression line. So I think it's pretty safe to say something is wrong with the LOESS implementation...
And from plotnine (using scikit-learn):
from plotnine import *
import pandas as pd
panel = pd.read_csv(
"https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)
g = ggplot(panel, aes(x="years_w_decimals", y="arrest_rate")) + geom_smooth(method='loess')
g
Thanks for the comparisons! These are very helpful. I'm speculating that the other methods may have implementations or default settings that do not include re-weighting for robustness against outliers, but I'll need to dig in to make sure. We may ultimately want to expose a parameter to toggle that aspect.
Sounds good!
BTW, if by weights you don't mean kernel weights / window weights, but rather something other than a simple minimization of squared errors, than I agree it would explain what's going on. However, is this is a non-standard loess implementation, then I agree it probably shouldn't be the default behavior, and/or should be well documented.
If I disable the robustness weights, I get this instead for bandwidth 0.6:
The y-axis range now matches the other methods, so that answers the initial question. That said, I do notice that with Vega the initial (2013) value is higher, whereas the others start lower with a convex peak. So we will need to further double check that aspect.
Wonderful! Thanks so much for running that down. Very reassuring to know what's going on, save any issues at the endpoints.
Please:
I'm relatively new to Altair / vega, but I've run into an issue with
transform_loess
that I think is an implementation error (I've posted on the Altair user forum, and stackoverflow thinking it was my own mistake, but doesn't seem like I've done anything wrong).Basically when I'm fitting a LOESS fit to my data, the entire LOESS line is being drawn below the sample average, below averages at each time point, and below my regression fit.
I've never used Vega directly, but here's the Altair code to generate the problem and the associated plot. The data is a panel with monthly arrest rate (part 2 crimes per 1,000 people) for number of localities. Here's a plot with monthly average rates, a linear regression fit, and my loess. As you can see, the loess is way below all the data:
The raw data has substantial skew, so my best guess (and that of an SO commenter) is that there's some issue with outliers.
Happy to try to get you a vega spec if you can point me in the right direction.