vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.62k stars 598 forks source link

Macro for regression / loess #3988

Open davidanthoff opened 6 years ago

davidanthoff commented 6 years ago

One thing that folks seem to use a lot with ggplot is to add a regression line to a plot with something like geom_smooth(method = "lm", se = FALSE). Would be great if there was a way to do something like that in vega-lite as well.

g3o2 commented 6 years ago

Check this issue in vega

As of now, you have two choices in vega-lite:

1) use a calculate transform to generate the line based on the model formula fitted outside of the vega ecosystem, 2) directly provide the relevant points as data, the model still being fitted outside of vegalite.

davidanthoff commented 6 years ago

I think it would be great if this could be done in pure vega-lite. Yes, i can run the regression outside of vega-lite, but that is way more cumbersome than what you get with ggplot. Also, imagine a situation where this is combined with interactivity, and then one would have to precompute potentially a lot of stuff outside of the plot.

domoritz commented 6 years ago

I agree that this would be a nice feature for Vega-Lite. Maybe the best way to implement this is with a custom mark type.

jacoduplessis commented 5 years ago

Vega now has a regression transform: https://vega.github.io/vega/docs/transforms/regression/

Can't wait to see this in Vega-Lite!

domoritz commented 5 years ago

It's coming in Vega-Lite 4.

@kanitw we can close this issue, right?

kanitw commented 5 years ago

The question is do we want to provide a macro for this in Vega-Lite?

A few options to consider with some examples to begin conversation:

A) Inline Regression Transform

For example, I can see layer_point_line_regression have the following shorter form:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        },
        "y": {
          // if the regression property is with y, it's on x and vice-versa.  
          // other non groupby field can be used as x-y
          "regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    }
  ]
}

This is clearly more concise that the full-form that we currently support and consistently with aggregation / timeUnit that have a short form in encoding. That said, one danger of this approach is that regression may not support clean combination with aggregate. (It's unclear what we should do if both aggregation and regression are specified.

B) Composite Mark

This would be another option, which would work well for supporting regression line in polestar.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "mark": {
    "type": "point",
    "filled": true,
    "loess/regression": true | {
       "mark": "line" // this should be implicit by default
       "type": "linear" // default
       "on": "x": // default  (on: "y" would be a transpose of this)
    }
  },
    "encoding": {
      "x": {
        "field": "Rotten_Tomatoes_Rating",
        "type": "quantitative"
      },
      "y": {
        "field": "IMDB_Rating",
        "type": "quantitative"
      }
    }
}
domoritz commented 5 years ago

I agree that we want to have a high-level mark or encoding property rather than relying on transforms.

A problem with the second approach is that it's hard to just have a regression line and no points. I'm also not a huge fan of the "on" property as it creates a link to the encodings and then the question is why we don't just put the regression in the encoding.

C) a concise way to write A)

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "encoding": {
        "y": {
          // if the regression property is with y, it's on x and vice-versa.  
          // other non groupby field can be used as x-y
          "regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
          // maybe we can temove the code below since we are defining the encoding at the layer level already
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    }
  ]
}

That said, one danger of this approach is that regression may not support clean combination with aggregate. (It's unclear what we should do if both aggregation and regression are specified.

We already have this with binning and aggregation.

kanitw commented 5 years ago

A problem with the second approach is that it's hard to just have a regression line and no points.

That's good point. However, we should avoid overriding encoding as it's making it harder to read the code if there are overriding parts. (We actually throw warning when overriding exists.)

With the proposal C), users have to read the outer encoding and inner encoding seperately and process the merging (namely that inner y is used for line, and the regression still applies "on" the outer x, which doesn't get replaced). So we should definitely avoid it.

D) Regression in Mark (without extra Mark)

To avoid overriding y-encoding, I think it's better to put regression in mark, akin to ggplot2's geom_smooth.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick",
        "regression/loess": true | {
            method: 'linear', // linear by default
            order: ..., 
            extent: ...,
            "on/predictor": "x": // default  (on: "y" would be a transpose of this)
        }
      }
    }
  ]
}

Note that the current loess does not output confidence interval band, but it might make sense to support that in the future. So we should see how this feature would interact with errorbar/band macro that we may add (#4131).

kanitw commented 5 years ago

That said, one danger of this approach is that regression may not support clean combination with aggregate. (It's unclear what we should do if both aggregation and regression are specified.

We already have this with binning and aggregation.

To clarify, if both aggregate and regression are specified in the same encoding, that should be an error. However, there is also a case where aggregate is on one encoding (e.g., 'x') and regression is on another (e.g., 'y'). Then it's unclear what do to (while the same thing with bin+aggregate is still pretty clear).

In any case, regression/loess shows relationship between x and y, not just either x or y -- so I'm leaning toward the regression macro that doesn't introduce an extra mark. Let's see if there are other aspects of D) that should be iterate.

jheer commented 5 years ago

What is the strong argument for including this in either an encoding channel or mark? Isn’t this unnecessarily “complecting” the API? One transform plus a standard line mark doesn’t seem so bad, limits the surface area, and maintains modularity. I’d certainly feel a bit better if this were an encoding level directive that played nice with binning, aggregation etc, but if that’s not possible I’m not sure a new mark is necessary. (I’ve always had mixed feelings about the complected ggplot geoms that mix transforms and geometries, though I think they are a reasonable usability compromise for more complex layered forms like box plots and violin plots.)

Also keep in mind that the transform might evolve in the future; for example to generate a confidence interval alongside the regression values. I’d rather have transform plus line and area than specialized “smooth” and “ribbon” marks... but I’m interested in hearing other arguments!

kanitw commented 5 years ago

What is the strong argument for including this in either an encoding channel or mark? Isn’t this unnecessarily “complecting” the API? One transform plus a standard line mark doesn’t seem so bad, limits the surface area, and maintains modularity.

I think the main argument for a macro is concision (independent of argument for a proper solution).

Consider errorbar/band, which is also not so bad as separate transforms and layers (one can just make error bar/band with rules). However, it's still bad that users won't be content with requiring layer with transform and repetitive encodings. Even with the macro that we already have, users still expects to avoid manual layering as we discuss in https://github.com/vega/vega-lite/issues/4422.

I'd argue that if we will do a macro for errorbar (and already did even for a simpler things like line's point overlay), then it's a bit inconsistent to argue that regression/loess (esp. with CIs output) aren't complex enough to justify that to consider a macro for it.

Also, requiring layering will also makes it hard for non-layer tools like PoleStar/Voyager/CompassQL to leverage regression features.

So I think we should consider if there is a reasonable solution at all. (We could choose not do it, if there is no good solution to do.)


Also keep in mind that the transform might evolve in the future; for example to generate a confidence interval alongside the regression values

Definitely. I actually commented the same thing here.

In fact, once we have confidence interval, the case for a macro that can do point + loess line/area (for CIs) would make a case for macro even more convincing than the current stage. (At that point, it's definitely complex enough to justify that the macro outweighs the cost of complecting the design, just like boxplot is complex enough.)


I’d certainly feel a bit better if this were an encoding level directive that played nice with binning, aggregation etc, but if that’s not possible I’m not sure a new mark is necessary.

That's actually possible. I'm a bit more ok with proposal C) if we don't allow encoding: {y: {regression: ...}} to stand alone without field/type like Dom suggested in the comment.

E) A more acceptable variant of C)

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        }
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "y": {
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "encoding": {
        "y": {
          "regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    }
  ]
}

We still need to deal with the following cases:

1) "aggregate is on one encoding (e.g., 'x') and regression is on another (e.g., 'y')". -- I guess we can either define that regression comes after aggregation or ban it entirely.

2) How to support the point + loess line/area (for CIs). This is still a big use case to consider.

I think regression as inline transform is still a bit awkward for this case as the line layer and ranged area layer (once we have CIs) takes different parts of the output from regression / loess. Plus, when we combine with raw layer, we need to repeat the transform multiple times:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        }
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "y": {
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "encoding": {
        "y": {
          "regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., }
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "area/errorband",
      },
      "encoding": {
        "y": {
          "regression/loess": true | {method: 'linear' | ..., order: ..., extent: ..., } // with area/errorband, the regression will map CIs to the output?
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
      }
    }
  ]
}

F)

For point + loess line/area (for CIs), a composite mark akin to geom_smooth that combines line and ranged area might actually make sense as it no longer makes sense to just augment a primitive mark with a regression/loess macro. Alternatively, we can consider how regression may interact with errorband+line macro.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {
    "url": "data/movies.json"
  },
  "encoding": {
        "x": {
          "field": "Rotten_Tomatoes_Rating",
          "type": "quantitative"
        }, 
        "y": {
          "field": "IMDB_Rating",
          "type": "quantitative"
        }
  },
  "layer": [
    {"mark": "circle"},
    {
      "mark": {
        // mark A)
        "type": "regressionline/smooth" // regressionline is probably a more proper name for a mark than smooth?,
        "line": ..., // all line properties
        "errorband": ..., // all ranged-area properties
        "method": 'linear' | 'loess' | ..., // switch between regression and loess here,
        ... // other properties of loess / regression

        // mark B) -- if we follow the proposal in https://github.com/vega/vega-lite/issues/4422#issuecomment-496313449 and doesn't want to introduce a new mark
        "type": "errorband",
        "line": true
        "regression": {
           "method": 'linear' | 'loess' | ..., // switch between regression and loess here,
           ... // other properties of loess / regression
        }
      }
    }
  ]
}

I don't think any of these are the ideal solutions yet, but we can iterate more on these different ideas.

Note: Ribbon is simply a ranged area, so we would never need it in VL.

davidanthoff commented 4 years ago

I think a composite mark that does the line and area in one go would be my preferred solution. I think that is the most common scenario that users would want to create, so providing a concise option for that seems most valuable to me. Certainly having something roughly as short as geom_smooth is what would be most helpful for VegaLite.jl, and the reason I opened this issue in the first place :)