vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.68k stars 611 forks source link

LOESS implementation error? #7686

Open nickeubank opened 3 years ago

nickeubank commented 3 years ago

Please:

I'm relatively new to Altair / vega, but I've run into an issue with transform_loess that I think is an implementation error (I've posted on the Altair user forum, and stackoverflow thinking it was my own mistake, but doesn't seem like I've done anything wrong).

Basically when I'm fitting a LOESS fit to my data, the entire LOESS line is being drawn below the sample average, below averages at each time point, and below my regression fit.

I've never used Vega directly, but here's the Altair code to generate the problem and the associated plot. The data is a panel with monthly arrest rate (part 2 crimes per 1,000 people) for number of localities. Here's a plot with monthly average rates, a linear regression fit, and my loess. As you can see, the loess is way below all the data:

image

import pandas as pd
import altair as alt

alt.data_transformers.disable_max_rows()

# Load panel data. Monthly arrest rate (part 2 crimes per 1,000 people)
# data for number of localities.

panel = pd.read_csv(
    "https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)

# And if I do averages for each month, I get
# a relatively smooth downward trend.

grouped_means = panel.groupby("years_w_decimals", as_index=False)[
    ["arrest_rate"]
].mean()

chart_grouped = (
    alt.Chart(grouped_means)
    .mark_circle(opacity=0.5)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
)

reg = (
    alt.Chart(panel)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
    .transform_regression(
        "years_w_decimals",
        "arrest_rate",
        method="poly",
        order=1,
    )
    .mark_line()
)

loess = (
    alt.Chart(panel)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
    .transform_loess(on="years_w_decimals", loess="arrest_rate", bandwidth=0.3)
    .mark_line()
)
reg + chart_grouped + loess

The raw data has substantial skew, so my best guess (and that of an SO commenter) is that there's some issue with outliers.

Happy to try to get you a vega spec if you can point me in the right direction.

nickeubank commented 3 years ago

Oh, wow -- I think I solved it. LOESS is calculating local medians, not means.

I re-did my grouped means as grouped medians and plotted, and I get this:

image

jheer commented 3 years ago

Hmm. For reference, here is the LOESS implementation used by Vega / Vega-Lite / Altair: https://github.com/vega/vega/blob/master/packages/vega-statistics/src/regression/loess.js

The calculation involves performing a series of OLS linear regressions over a sliding window of weight-adjusted points. There is no direct median calculation within this. (You will see a median calculation in the source code applied to residuals -- it is used to determine stopping conditions and perform weight adjustments -- but it is not applied directly in the regression itself.) I think what is happening here is that this robust weight adjustment is canceling out the contribution of outliers in the data, hence the better match to grouped medians than grouped means.

One sanity test is to plot all the data as points -- not grouped means/medians -- and see how that looks (allowing outliers to be seen more clearly). If you have heavy skew / large outliers, grouped means and OLS linear regression may not be the best choices for your data!

nickeubank commented 3 years ago

If you have heavy skew / large outliers, grouped means and OLS linear regression may not be the best choices for your data!

Sure -- this isn't really about that. I was just doing some data explorations when I discovered this, and want to understand it.

For context, here's the result of the Stata implementation of Loess on this data:


sort years_w_decimals
lowess arrest_rate years_w_decimals , nograph gen(yhat)
graph twoway line yhat years_w_decimals

image

And here's R:

l1 = loess(arrest_rate ~ years_w_decimals, df)
result = predict(l1)
plot(result, x=df[!is.na(df$arrest_rate),"years_w_decimals"], col="red")

image

All basically match the regression line. So I think it's pretty safe to say something is wrong with the LOESS implementation...

nickeubank commented 3 years ago

And from plotnine (using scikit-learn):

image

from plotnine import *
import pandas as pd

panel = pd.read_csv(
    "https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)

g = ggplot(panel, aes(x="years_w_decimals", y="arrest_rate")) + geom_smooth(method='loess')
g
jheer commented 3 years ago

Thanks for the comparisons! These are very helpful. I'm speculating that the other methods may have implementations or default settings that do not include re-weighting for robustness against outliers, but I'll need to dig in to make sure. We may ultimately want to expose a parameter to toggle that aspect.

nickeubank commented 3 years ago

Sounds good!

BTW, if by weights you don't mean kernel weights / window weights, but rather something other than a simple minimization of squared errors, than I agree it would explain what's going on. However, is this is a non-standard loess implementation, then I agree it probably shouldn't be the default behavior, and/or should be well documented.

jheer commented 3 years ago

If I disable the robustness weights, I get this instead for bandwidth 0.6: image

The y-axis range now matches the other methods, so that answers the initial question. That said, I do notice that with Vega the initial (2013) value is higher, whereas the others start lower with a convex peak. So we will need to further double check that aspect.

nickeubank commented 3 years ago

Wonderful! Thanks so much for running that down. Very reassuring to know what's going on, save any issues at the endpoints.