vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.56k stars 595 forks source link

2D Boxplots #7822

Open saurabh-ironman opened 2 years ago

saurabh-ironman commented 2 years ago

Please:

You are encouraged to prototype multiple alternative syntaxes for your proposed feature. Doing so often leads to a better design.

Please refer below sample example for developing 2 Dimensional boxplots.

timeseries_values = [8.894, 15.023, 8.605, 8.278, 12.224]
{median, {q1, q3}, iqr, {lower_whiskers, upper_whiskers}, outliers} =
  timeseries_values |> BoxplotStats.stats()

data_values = [25, 13, 22, 30, 60]
{data_median, {data_q1, data_q3}, data_iqr, {data_lower_whiskers, data_upper_whiskers}, data_outliers} =
  data_values |> BoxplotStats.stats()

data = [
  %{
    "event" => "AAA",
    "timeseries_median" => median,
    "timeseries_q1" => q1,
    "timeseries_q3" => q3,
    "timeseries_iqr" => iqr,
    "timeseries_lo_whisker" => lower_whiskers,
    "timeseries_up_whisker" => upper_whiskers,
    "timeseries_lo_outlier" => 1,
    "timeseries_up_outlier" => 20,
    "data_median" => data_median,
    "data_q1" => data_q1,
    "data_q3" => data_q3,
    "data_iqr" => data_iqr,
    "data_lo_whisker" => data_lower_whiskers,
    "data_up_whisker" => data_upper_whiskers,
    "data_lo_outlier" => 8,
    "data_up_outlier" => 50
  }
]

data |> inspect |> IO.puts()

Vl.new(height: 480, width: 500, title: "Composite Boxplot W Timeseries execution & coresponding data values")
|> Vl.layers([
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:bar, tooltip: true)
  |> Vl.encode(:size, value: 20)
  |> Vl.encode_field(:x, "timeseries_q1", type: :quantitative, title: "TimeSeries")
  |> Vl.encode_field(:x2, "timeseries_q3")
  |> Vl.encode_field(:y, "data_q1", type: :quantitative, title: "Data")
  |> Vl.encode_field(:y2, "data_q3"),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, color: :white, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative)
  |> Vl.encode_field(:y, "data_q1", type: :quantitative)
  |> Vl.encode_field(:y2, "data_q3", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, color: :white, tooltip: true)
  |> Vl.encode_field(:y, "data_median", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_q1", type: :quantitative)
  |> Vl.encode_field(:x2, "timeseries_q3", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_q1", type: :quantitative)
  |> Vl.encode_field(:x2, "timeseries_lo_whisker")
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_q3", type: :quantitative)
  |> Vl.encode_field(:x2, "timeseries_up_whisker")
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:y, "data_q1", type: :quantitative)
  |> Vl.encode_field(:y2, "data_lo_whisker")
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:y, "data_q3", type: :quantitative)
  |> Vl.encode_field(:y2, "data_up_whisker")
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_lo_whisker", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_up_whisker", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true, orient: :horizontal)
  |> Vl.encode_field(:y, "data_lo_whisker", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true, orient: :horizontal)
  |> Vl.encode_field(:y, "data_up_whisker", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_lo_outlier", type: :quantitative)
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_up_outlier", type: :quantitative)
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:y, "data_lo_outlier", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:y, "data_up_outlier", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative)
])

Please refer below screen shot for the sample 2D boxplot which we are proposing as an enhancement: image

domoritz commented 2 years ago

2D box plots make sense. We talked about them when we first added box plot support as well https://github.com/vega/vega-lite/pull/2264#discussion_r113588771. Not sure what happened with the discussion. Maybe @kanitw remembers.

Depending on whether we want them in Vega-Lite, would you be willing to contribute a pull request for this feature?

saurabh-ironman commented 2 years ago

@domoritz Yes i would like to contribute on this and would love to submit a PR with the changes. Please let me know if there is any specific process i need to follow for this. Also if there are development documents which will give an idea of the code base and how to make changes and PRs. please point to that... Thanks!

domoritz commented 2 years ago

Take a look at the contributing.md file.

saurabh-ironman commented 2 years ago

@domoritz Thanks! i will start going through this document and understand the process.

domoritz commented 2 years ago

You will only need to modify the normalizer. This is where we implement box plots.

saurabh-ironman commented 2 years ago

@domoritz sure thank you for the pointers :)

kanitw commented 2 years ago

I have questions whether Boxplot for 2D like provided in the screenshot is particularly useful.
Even the article linked doesn't seem to suggest this particular design.

saurabh-ironman commented 2 years ago

@kanitw Thanks for the question. Let me explain you my idea/reason for suggesting 2D boxplot as a new mark inclusion in vega lite. This answer may be a bit lengthy but covers different aspects of proposing 2D boxplot mark.

Details/Description:

image

Yes, Boxplot has some drawbacks. Should you stop using it?

So yes, boxplot has some well-known drawbacks, but it is STILL in pretty much all charting/graphing packages because it is still in the end useful for data visualization. Thus, 2D boxplots are worthy inclusion.

Conclusion/summary:

Hope this helps you in understanding the idea for this enhancement.

Thanks, Saurabh

czrpb commented 2 years ago

Hi! There are a couple of us working on this and I would like to emphasize the point that the goal is to come up with a visualization that helps understand distribution & variation in 2d.

So, we provided the link to some examples of using area in 2d to show aggregates; ex: bagplot. But, there does not seem to be a "go to" or standard chart for distribution in 2d. The above reply says that instead of creating a new visualization we feel that if boxplots are still an acceptable aggregate visualization, then extending them to 2d should be preferable to coming up with a new one; we think it would be a useful addition to the examples given in the Distributions section of: https://vega.github.io/vega/examples/

Also, to describe the example/data we are working with a bit: Imagine we have processes that start and stop during some timeframe. During this time we have a 2nd set of data (in the example it is CPU utilization). What we want to show is that for many instances of these processes and their attendant 2nd measure (CPU%), what is the distribution of start/end and 2nd measure (CPU%) variation. So, the example above shows that for Process A its start/end IRQ is ~3seconds to ~12.5seconds and the CPU utilization IRQ during this timeframe is ~19% to almost 60%. Now, imagine adding additional processes to this chart or faceting per process, both of which we have implemented. After visualizing multiple Processes and their 2nd measure (CPU%), we can see that we should wonder about those Process with a "large" area: These are Processes with large variabilities in duration and utilization, why?

Im sure we could have a fantastic discussion about other visualizations for this, but hopefully here we have given some reasonable justification for 2d boxplots (even if ultimately not accepted! :) ).

kanitw commented 2 years ago

Thanks for more explanation.

So yes, boxplot has some well-known drawbacks, but it is STILL in pretty much all charting/graphing packages because it is still in the end useful for data visualization. Thus, 2D boxplots are worthy inclusion.

I think everyone would agree that standard (1D) boxplots are useful despite its drawbacks.

However, my problem with variants of 2D boxplots is that there doesn't seem to be a universally accepted one.
There are many variants, but it's unclear which one is the one worth implementing.

I strongly think that it's probably better if the effort is spent on bring contour plots, which seems to be more widely used, to Vega-Lite.

Note that I'm totally ok if you want to add some variants of these 2D boxplots as Vega or Vega Lite examples. They could be useful for people who may prefer specific variants of 2D box plots. However, to say that we support them as built-in boxplot type, I'd like to see a more concrete proposal for which 2D boxplot to implement, and why we should prioritize such chart type over a contour plot.