johnmyleswhite / Vega.jl

A Julia package for generating visualizations in Vega
Other
84 stars 22 forks source link

subplots/layer/faceting plots #73

Closed sglyon closed 8 years ago

sglyon commented 8 years ago

I often need to create figures with multiple subplots

After poking around in the vega editor (mostly with this example), I think the way Vega handles this is by a combination of two things:

  1. defining a x scale and a y scale for each subplot. The range of each scale will determine the size of where that subplot will be drawn (i.e. the range on the x scale will determine the width of the x axis for this subplot)
  2. Creating each subplot as a group mark, where the position of the subplot is determined by properties.enter.x and properties.enter.y of the mark group and the size is determined by properties.enter.width and properties.enter.height. This is easier to see with an example:
  "scales": [
    {
      "name": "xOverview",
      "type": "time",
      "range": [0, 720],
      "domain": {"data": "sp500", "field": "date"}
    },
    {
      "name": "yOverview",
      "type": "linear",
      "range": [170, 0],
      "nice": true,
      "domain": {"data": "sp500", "field": "price"}
    },
    # other scales for other subplots
]
  "marks": [
    {
      "type": "group",
      "name": "overview",
      "properties": {
        "enter": {
          "y": {"value": 480},
          "x": {"value": 0},  # not needed, as this is the default
          "height": {"value": 170},
          "width": {"value": 720}
        }
      },
     # Describe the rest of the subplot in this group mark
    },
    # Other group marks for the other subplots
]

Here this subplot would start 480 pixels down from the top of the visualization and 0 to the right pixels right of the left border (so flush with the left border). Also notice that the properties.enter.height and properties.enter.width correspond to the width of the yOverview and xOverview scale ranges, respectively

So in absolute coordinates of the whole visualization, this subplot will span, in the x dimension, (properties.enter.x, properties.enter.x + properties.enter.width == (0, 720) and in the y dimension (properties.enter.y, properties.enter.y + properties.enter.height == (480, 650))

I don't know if you have put much thought into how this could be supported within Vega.jl, but it is something I'd love to see happen and am happy to pitch in to make it work.

randyzwitch commented 8 years ago

Your link goes to a blank example; can you attach a full Vega spec or a picture.

In general, I want to support everything :)

sglyon commented 8 years ago

Oh yeah, sorry about that. The link probably went away because I edited part of it so it became "custom" instead of one of the example plots.

Here it is (the overview+detail plot in the interactive section of the new vega live editor)

{
  "width": 720,
  "height": 480,

  "signals": [
    {
      "name": "brush_start",
      "streams": [{
        "type": "@overview:mousedown", 
        "expr": "eventX()", 
        "scale": {"name": "xOverview", "invert": true}
      }]
    },
    {
      "name": "brush_end",
      "init": {"expr": "datetime('Jan 1 2000')"},
      "streams": [{
        "type": "@overview:mousedown, [@overview:mousedown, window:mouseup] > window:mousemove",
        "expr": "clamp(eventX(), 0, 720)",
        "scale": {"name": "xOverview", "invert": true}
      }]
    },
    {
      "name": "min_date", 
      "init": {"expr": "datetime('Jan 1 2000')"},
      "expr": "time(brush_start) === time(brush_end) ? datetime('Jan 1 2000') : min(brush_start, brush_end)"
    },
    {
      "name": "max_date", 
      "init": {"expr": "datetime('Mar 1 2010')"},
      "expr": "time(brush_start) === time(brush_end) ? datetime('Mar 1 2010') : max(brush_start, brush_end)"
    }
  ],

  "data": [
    {
      "name": "sp500", 
      "url": "data/sp500.csv",
      "format": {"type": "csv", "parse": {"price": "number", "date": "date"}}
    }
  ],

  "scales": [
    {
      "name": "xOverview",
      "type": "time",
      "range": [0, 720],
      "domain": {"data": "sp500", "field": "date"}
    },
    {
      "name": "yOverview",
      "type": "linear",
      "range": [70, 0],
      "nice": true,
      "domain": {"data": "sp500", "field": "price"}
    },
    {
      "name": "xDetail",
      "type": "time",
      "range": [0, 720],
      "domainMin": {"signal": "min_date"},
      "domainMax": {"signal": "max_date"}
    },
    {
      "name": "yDetail",
      "type": "linear",
      "range": [390, 0],
      "nice": true,
      "domain": {"data": "sp500", "field": "price"}
    }
  ],

  "marks": [
    {
      "type": "group",
      "name": "detail",
      "properties": {
        "enter": {
          "height": {"value": 390},
          "width": {"value": 720}
        }
      },
      "axes": [
        {"type": "x", "scale": "xDetail"},
        {"type": "y", "scale": "yDetail"}
      ],
      "marks": [
        {
          "type": "group",
          "properties": {
            "enter": {
              "height": {"field": {"group": "height"}},
              "width": {"field": {"group": "width"}},
              "clip": {"value": true}
            }
          },
          "marks": [
            {
              "type": "area",
              "from": {"data": "sp500"},
              "properties": {
                "update": {
                  "x": {"scale": "xDetail", "field": "date"},
                  "y": {"scale": "yDetail", "field": "price"},
                  "y2": {"scale": "yDetail", "value": 0},
                  "fill": {"value": "steelblue"}
                }
              }
            }
          ]
        }
      ]
    },

    {
      "type": "group",
      "name": "overview",
      "properties": {
        "enter": {
          "x": {"value": 0},
          "y": {"value": 430},
          "height": {"value": 70},
          "width": {"value": 720}
        }
      },
      "axes": [
        {"type": "x", "scale": "xOverview"}
      ],
      "marks": [
        {
          "type": "area",
          "from": {"data": "sp500"},
          "properties": {
            "update": {
              "x": {"scale": "xOverview", "field": "date"},
              "y": {"scale": "yOverview", "field": "price"},
              "y2": {"scale": "yOverview", "value": 0},
              "fill": {"value": "steelblue"}
            }
          }
        },
        {
          "type": "rect",
          "properties":{
            "enter":{
              "y": {"value": 0},
              "height": {"value":70},
              "fill": {"value": "#333"},
              "fillOpacity": {"value":0.2}
            },
            "update":{
              "x": {"scale": "xOverview", "signal": "brush_start"},
              "x2": {"scale": "xOverview", "signal": "brush_end"}
            }
          }
        }
      ]
    }

  ]
}
randyzwitch commented 8 years ago

Yeah, I have thought about this one. It feels like a mutating function, not a visualization per se, but at the same time I'm having a hard time thinking how many visualizations this applies too. Would it just be time-series plots like area/line?

The other broader question is whether this fits into a layer type function. A few of the plots I've made like a rugplot aren't so interesting by themselves, but if you append a rugplot to another distribution it makes more sense.

sglyon commented 8 years ago

So really I think of this as more of a way to layout or nest other visualizations.

I think it could apply to any type of plot, though my use cases are almost always time-series line plots.

I'm not too sure what a layer api would mean. What ideas do you have for it?

randyzwitch commented 8 years ago

My thoughts for layer is just the same as Gadfly or other packages. It doesn't make sense to have a bar and line chart function, but there are occasions where you might want to overlay the two types.

So having a way of combining the data and marks from several visualizations into one would allow for each visualization to be "atomic" so-to-speak, then any combination of chart could be created. But you're right, the case you're outlining here isn't "on top" but rather in addition to.

sglyon commented 8 years ago

Ok, that concept of layers isn't totally orthogonal or unrelated to what I want to do.

While thinking about this I stumbled on a problem we will need to solve for either approach. Right now the data always enters the visualization in a data object where the name property defaults to "table". Unfortunately we rely on that name in many places. Do you have any ideas for how we can generalize the name field, without making users do too much name handling/passing on their own?

randyzwitch commented 8 years ago

I don't plan to make that an issue for the public facing API. I ran into that before, and I just added a keyword argument to add_data!

https://github.com/johnmyleswhite/Vega.jl/blob/master/src/intermediates/data.jl#L2

So in layer, we'd just test for values of "table" and make "table2", "table3", etc.

sglyon commented 8 years ago

OK cool. I was hoping we'd be able to find a solution where the user never has to think about it.

We might need to have a function that changes the data table name for a given visualization.

The reason is I want to be able to create two visualizations independently, then position them as subplots after the fact. If they are both created using "table", we will need to be able to go through the entire spec and all references to "table" so that when we combine the specs the names don't clash

tbreloff commented 8 years ago

I think Bokeh has the right solution here, which is to generate a random uuid string as the name.

On Thu, Nov 5, 2015 at 2:50 PM, Spencer Lyon notifications@github.com wrote:

OK cool. I was hoping we'd be able to find a solution where the user never has to think about it.

We might need to have a function that changes the data table name for a given visualization.

The reason is I want to be able to create two visualizations independently, then position them as subplots after the fact. If they are both created using "table", we will need to be able to go through the entire spec and all references to "table" so that when we combine the specs the names don't clahs

— Reply to this email directly or view it on GitHub https://github.com/johnmyleswhite/Vega.jl/issues/73#issuecomment-154171204 .

randyzwitch commented 8 years ago

That's not a bad solution, I already create a random string for the div name so that one plot doesn't overwrite another in Jupyter notebook.

First one to submit a pull request gets to decide how it works :)

cndesantana commented 8 years ago

Hi all,

I came into a similar problem. I want to plot a figure in which I combine lineplot and a dotplot (dots and lines in the same figure). I was wondering if I could do something like this:

    x = [1:100]
    y1 = collect([1:100] + randn(100))
    y2 = collect([1:100] + randn(100))
    p1 = dotplot(x=x,y=y1);
    p2 = p1 + lineplot(x=x,y=y2);

Of course, it didn't work. But I understand you are talking here about an issue very similar to what I want. Did you have any progress here?

randyzwitch commented 8 years ago

No, + won't add the plots together.

I haven't had time to work on this, partially because I presume it's complicated. For the example you put though, does a lineplot with points work (2nd and 3rd examples)?

http://johnmyleswhite.github.io/Vega.jl/lineplot.html

Fundamentally, layering plots is no different than how I implemented the "points" keyword. The biggest issue is figuring out how to deal with aggregating multiple data series of different mark types, how to determine what the axes ranges should be, etc.

If you can post a more realistic dataset, or a picture of what you are trying to accomplish, I can start trying to hack out a solution.

cndesantana commented 8 years ago

Thanks for your response.

Actually I want to plot different information in the same plot. So the 2nd and 3rd examples of the documentation you mention are not useful in my case.

I want to study the "Efficient Frontier" of stock data, something like the figure below (that was made with Matlab).

efffrontier

In R I could do it by calling plot() followed by points(). Or using the "+".

Another idea that came to my mind was to define different primitives for different groups. In the same way we define that one group will be plotted in darkblue and the other one in lightblue, we could define that one group would be plotted as "dots" and the other group as "line"? Does it make any sense?

Thanks anyway for your effort and interest!

Best,


From: Randy Zwitch [notifications@github.com] Sent: Saturday, December 19, 2015 1:35 AM To: johnmyleswhite/Vega.jl Cc: De Santana, Charles Subject: Re: [Vega.jl] subplots/layer plots (#73)

No, + won't add the plots together.

I haven't had time to work on this, partially because I presume it's complicated. For the example you put though, does a lineplot with points work (2nd and 3rd examples)?

http://johnmyleswhite.github.io/Vega.jl/lineplot.html

Fundamentally, layering plots is no different than how I implemented the "points" keyword. The biggest issue is figuring out how to deal with aggregating multiple data series of different mark types, how to determine what the axes ranges should be, etc.

If you can post a more realistic dataset, or a picture of what you are trying to accomplish, I can start trying to hack out a solution.

— Reply to this email directly or view it on GitHubhttps://github.com/johnmyleswhite/Vega.jl/issues/73#issuecomment-165927245.

tbreloff commented 8 years ago

I know this isn't a good solution for Vega plots, but if you're desperate to make a layered plot, Plots.jl is extremely flexible for this sort of stuff. If you check out master, there's lots of support for the new Plotly javascript interface as well.

On Dec 18, 2015, at 6:02 PM, cndesantana notifications@github.com wrote:

Hi all,

I came into a similar problem. I want to plot a figure in which I combine lineplot and a dotplot (dots and lines in the same figure). I was wondering if I could do something like this:

x = [1:100]
y = collect([1:100] + randn(100))
p1 = dotplot(x=x,y=y);
p2 = p1 + lineplot(x=x,y=y);

Of course, it didn't work. But I suppose you are talking about this problem here. Did you have any progress here?

— Reply to this email directly or view it on GitHub.

randyzwitch commented 8 years ago

@cndesantana Here's a basic working example:

#I presume you can calculate this data yourself, I took from internet example
meanret = [0.0019, 0.0053, 0.0137, 0.0054, 0.0047, 0.0029]
sdret = [0.0453, 0.0602, 0.0808, 0.0546, 0.0265, 0.0582]
efmean =  [0.0016, 0.0020, 0.0030, 0.0040, 0.0050, 0.0060, 0.0070, 0.0075, 0.0080, 0.0090, 0.0125] 
efsd = [0.0250, 0.0255, 0.0274, 0.0299, 0.0328, 0.0361, 0.0408, 0.0436, 0.0467, 0.0533, 0.0801] 

#Make scatterplot of actuals
s = scatterplot(y = meanret, x = sdret)

#Make ef line
eh = lineplot(x = efsd, y = efmean)

#Make names unique in ef line
eh.data[1].name = eh.marks[1].from.data = eh.scales[1].domain.data = eh.scales[2].domain.data = eh.scales[3].domain.data = "table2"

#Since same axis range, just push data and line mark onto scatterplot graph
push!(s.data, eh.data[1])
push!(s.marks, eh.marks[1])

#Show graph
s

download

This highlights the basic problem I'll need to work through: every graph currently creates a data table named table, so I'll need to do the UUID suggestion from earlier in the thread. As long as the axis are the same scale, then it's fine, but I'll need to calculate what range y/y2 and x/x2 are in jointly in, then make that the limits as a generic solution.

Not insurmountable, but needs some work. Maybe I can knock this out Monday at work, I can't imagine I'll have a lot to do in the short holiday week!

cndesantana commented 8 years ago

Totally amazing!! Thanks a lot, Randy!

Charles

randyzwitch commented 8 years ago

@cndesantana @spencerlyon2 (or even @tbreloff)

If you do Pkg.checkout("Vega"), there is a new function defined called layer(plot1::VegaVisualization, plot2::VegaVisualization). This function can be used as the following:

#Taken from internet example
meanret = [0.0019, 0.0053, 0.0137, 0.0054, 0.0047, 0.0029]
sdret = [0.0453, 0.0602, 0.0808, 0.0546, 0.0265, 0.0582]
efmean =  [0.0016, 0.0020, 0.0030, 0.0040, 0.0050, 0.0060, 0.0070, 0.0075, 0.0080, 0.0090, 0.0125] 
efsd = [0.0250, 0.0255, 0.0274, 0.0299, 0.0328, 0.0361, 0.0408, 0.0436, 0.0467, 0.0533, 0.0801];

a = layer(scatterplot(y = meanret, x = sdret), 
          colorscheme!(lineplot(x = efsd, y = efmean), palette = "purple")
         )

With the result:

download 1

I'd love to get feedback on this function, as I've only tested it with scatterplot and lineplot at this time. I think it will work for any combination of plots where there are X and Y axes. If you could play with it and let me know what works, what's weird (having to nest colorscheme! is awkward) , and what plain doesn't work, that would help me refine the function before making an announcement and adding it to the documentation.

sglyon commented 8 years ago

This is cool, good work.

What I was hoping for when I opened the issue is some way to create plots side by side. Something like the examples here: https://plot.ly/javascript/subplots/

Is this new function a first step in that direction?

randyzwitch commented 8 years ago

Thanks.

I'm definitely aware of what you are talking about, this layer function is really just the first step in some higher level layout functions.

In the background, every data table now gets its own unique name and marks that refer to the unique name. It should now be easier to define both the faceted example (subplots by a data value), then the arbitrary layout case.

On Dec 23, 2015, at 5:13 PM, Spencer Lyon notifications@github.com wrote:

This is cool, good work.

What I was hoping for when I opened the issue is some way to create plots side by side. Something like the examples here: https://plot.ly/javascript/subplots/

Is this new function a first step in that direction?

— Reply to this email directly or view it on GitHub.

randyzwitch commented 8 years ago

Total Fail:

Pass-ish

Total Pass

cndesantana commented 8 years ago

Thanks a lot! It is a great feature!!

One more comment. It seems that the colours of the groups are not recognized by the legend when we use the colorscheme! function. Changing a bit the same example you posted above:

    meanret = [0.0019, 0.0053, 0.0137, 0.0054, 0.0047, 0.0029]
    sdret = [0.0453, 0.0602, 0.0808, 0.0546, 0.0265, 0.0582]
    efmean =  [0.0016, 0.0020, 0.0030, 0.0040, 0.0050, 0.0060, 0.0070, 0.0075, 0.0080, 0.0090, 0.0125] 
    efsd = [0.0250, 0.0255, 0.0274, 0.0299, 0.0328, 0.0361, 0.0408, 0.0436, 0.0467, 0.0533, 0.0801];
    groups = [1, 2, 1, 2, 3, 3]

    a = layer(scatterplot(y = meanret, x = sdret, group = groups), 
          colorscheme!(lineplot(x = efsd, y = efmean), palette = "purple") )

The resulting figure has dots in 3 different colours, according to the 3 different groups. However the legend shows only 1 colour (the colour of the lineplot).

wrong_legend

If we remove the parameter 'palette = "purple" ' from the colorscheme! function, the line and the legend are both coloured in "black".

legend_black

However, the legend is correctly plotted if I don't use the colorscheme! function. On the other hand, I can not define the colour of the lineplot. For example If I do:

    b = layer(scatterplot(y = meanret, x = sdret, group = groups), lineplot(x = efsd, y = efmean))

correct_legend

randyzwitch commented 8 years ago

Thanks for testing, it definitely needs some work

randyzwitch commented 8 years ago

Closing this since its meandering, will open a new meta issue with some concrete tasks