LOESS implementation error?

nickeubank commented 3 years ago

Please:

[x] Check for duplicate issues. Please file separate requests as separate issues.
[x] Describe how to reproduce the bug.
[x] Use the latest versions of Vega and Vega-Lite.
[] Provide an minimal, reproducible example spec in JSON, wrapped by triple backticks like this:
[x] If applicable, include error messages and screenshots, GIF videos (e.g. using https://www.cockos.com/licecap/), or working examples (e.g. by clicking share in the Vega-Editor or https://bit.ly/vega-lite-blocks)

I'm relatively new to Altair / vega, but I've run into an issue with transform_loess that I think is an implementation error (I've posted on the Altair user forum, and stackoverflow thinking it was my own mistake, but doesn't seem like I've done anything wrong).

Basically when I'm fitting a LOESS fit to my data, the entire LOESS line is being drawn below the sample average, below averages at each time point, and below my regression fit.

I've never used Vega directly, but here's the Altair code to generate the problem and the associated plot. The data is a panel with monthly arrest rate (part 2 crimes per 1,000 people) for number of localities. Here's a plot with monthly average rates, a linear regression fit, and my loess. As you can see, the loess is way below all the data:

import pandas as pd
import altair as alt

alt.data_transformers.disable_max_rows()

# Load panel data. Monthly arrest rate (part 2 crimes per 1,000 people)
# data for number of localities.

panel = pd.read_csv(
    "https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)

# And if I do averages for each month, I get
# a relatively smooth downward trend.

grouped_means = panel.groupby("years_w_decimals", as_index=False)[
    ["arrest_rate"]
].mean()

chart_grouped = (
    alt.Chart(grouped_means)
    .mark_circle(opacity=0.5)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
)

reg = (
    alt.Chart(panel)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
    .transform_regression(
        "years_w_decimals",
        "arrest_rate",
        method="poly",
        order=1,
    )
    .mark_line()
)

loess = (
    alt.Chart(panel)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
    .transform_loess(on="years_w_decimals", loess="arrest_rate", bandwidth=0.3)
    .mark_line()
)
reg + chart_grouped + loess

The raw data has substantial skew, so my best guess (and that of an SO commenter) is that there's some issue with outliers.

Happy to try to get you a vega spec if you can point me in the right direction.

nickeubank commented 3 years ago

Oh, wow -- I think I solved it. LOESS is calculating local medians, not means.

I re-did my grouped means as grouped medians and plotted, and I get this:

jheer commented 3 years ago

Hmm. For reference, here is the LOESS implementation used by Vega / Vega-Lite / Altair: https://github.com/vega/vega/blob/master/packages/vega-statistics/src/regression/loess.js

The calculation involves performing a series of OLS linear regressions over a sliding window of weight-adjusted points. There is no direct median calculation within this. (You will see a median calculation in the source code applied to residuals -- it is used to determine stopping conditions and perform weight adjustments -- but it is not applied directly in the regression itself.) I think what is happening here is that this robust weight adjustment is canceling out the contribution of outliers in the data, hence the better match to grouped medians than grouped means.

One sanity test is to plot all the data as points -- not grouped means/medians -- and see how that looks (allowing outliers to be seen more clearly). If you have heavy skew / large outliers, grouped means and OLS linear regression may not be the best choices for your data!

nickeubank commented 3 years ago

If you have heavy skew / large outliers, grouped means and OLS linear regression may not be the best choices for your data!

Sure -- this isn't really about that. I was just doing some data explorations when I discovered this, and want to understand it.

For context, here's the result of the Stata implementation of Loess on this data:


sort years_w_decimals
lowess arrest_rate years_w_decimals , nograph gen(yhat)
graph twoway line yhat years_w_decimals

And here's R:

l1 = loess(arrest_rate ~ years_w_decimals, df)
result = predict(l1)
plot(result, x=df[!is.na(df$arrest_rate),"years_w_decimals"], col="red")

All basically match the regression line. So I think it's pretty safe to say something is wrong with the LOESS implementation...

nickeubank commented 3 years ago

And from plotnine (using scikit-learn):

from plotnine import *
import pandas as pd

panel = pd.read_csv(
    "https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)

g = ggplot(panel, aes(x="years_w_decimals", y="arrest_rate")) + geom_smooth(method='loess')
g

jheer commented 3 years ago

Thanks for the comparisons! These are very helpful. I'm speculating that the other methods may have implementations or default settings that do not include re-weighting for robustness against outliers, but I'll need to dig in to make sure. We may ultimately want to expose a parameter to toggle that aspect.

nickeubank commented 3 years ago

Sounds good!

BTW, if by weights you don't mean kernel weights / window weights, but rather something other than a simple minimization of squared errors, than I agree it would explain what's going on. However, is this is a non-standard loess implementation, then I agree it probably shouldn't be the default behavior, and/or should be well documented.

jheer commented 3 years ago

If I disable the robustness weights, I get this instead for bandwidth 0.6:

The y-axis range now matches the other methods, so that answers the initial question. That said, I do notice that with Vega the initial (2013) value is higher, whereas the others start lower with a convex peak. So we will need to further double check that aspect.

nickeubank commented 3 years ago

Wonderful! Thanks so much for running that down. Very reassuring to know what's going on, save any issues at the endpoints.

vega / vega-lite

LOESS implementation error? #7686