2D Boxplots - Githubissues

saurabh-ironman commented 2 years ago

Please:

[ ] Check for duplicate issues. Please file separate requests as separate issues on GitHub.
Verified existing issues and this enhancement request is not there.
[ ] Describe the feature's goal, motivating use cases, and its expected behavior. The Boxplots we’re familiar with, visualize 1D (one-dimensional) distributions. What this means, is that these charts display the distribution of data on a single scale/axis horizontal or vertical. However, there are a series of Boxplot variations that can display the distribution of data over two, three or even more dimensions. 2D boxplot uses both x and y axis to plot two set of data. We can use 2 dimensional boxplots to represent data variations of 2 axis in conjunction with each other. This is where having 2D boxplots as inbuilt Vega lite mark will be so useful. Please refer below link for the reference. Here idea is to create 2D boxplot but using same traditional boxes http://datavizcatalogue.com/blog/multidimensional-boxplot-variations/
[ ] If you are proposing a new syntax, please provide at least one example spec, wrapped by triple backticks like this: NA, will use existing type script syntax for developing new 2D boxplot mark
```
{
"mark": "point",
"encoding": {"x": {"field": "a"}}
}
```

You are encouraged to prototype multiple alternative syntaxes for your proposed feature. Doing so often leads to a better design.

[ ] If applicable, include screenshots, GIF videos (e.g. using https://www.cockos.com/licecap/), or working example (e.g. example Vega specification for the requested feature)

Please refer below sample example for developing 2 Dimensional boxplots.

timeseries_values = [8.894, 15.023, 8.605, 8.278, 12.224]
{median, {q1, q3}, iqr, {lower_whiskers, upper_whiskers}, outliers} =
  timeseries_values |> BoxplotStats.stats()

data_values = [25, 13, 22, 30, 60]
{data_median, {data_q1, data_q3}, data_iqr, {data_lower_whiskers, data_upper_whiskers}, data_outliers} =
  data_values |> BoxplotStats.stats()

data = [
  %{
    "event" => "AAA",
    "timeseries_median" => median,
    "timeseries_q1" => q1,
    "timeseries_q3" => q3,
    "timeseries_iqr" => iqr,
    "timeseries_lo_whisker" => lower_whiskers,
    "timeseries_up_whisker" => upper_whiskers,
    "timeseries_lo_outlier" => 1,
    "timeseries_up_outlier" => 20,
    "data_median" => data_median,
    "data_q1" => data_q1,
    "data_q3" => data_q3,
    "data_iqr" => data_iqr,
    "data_lo_whisker" => data_lower_whiskers,
    "data_up_whisker" => data_upper_whiskers,
    "data_lo_outlier" => 8,
    "data_up_outlier" => 50
  }
]

data |> inspect |> IO.puts()

Vl.new(height: 480, width: 500, title: "Composite Boxplot W Timeseries execution & coresponding data values")
|> Vl.layers([
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:bar, tooltip: true)
  |> Vl.encode(:size, value: 20)
  |> Vl.encode_field(:x, "timeseries_q1", type: :quantitative, title: "TimeSeries")
  |> Vl.encode_field(:x2, "timeseries_q3")
  |> Vl.encode_field(:y, "data_q1", type: :quantitative, title: "Data")
  |> Vl.encode_field(:y2, "data_q3"),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, color: :white, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative)
  |> Vl.encode_field(:y, "data_q1", type: :quantitative)
  |> Vl.encode_field(:y2, "data_q3", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, color: :white, tooltip: true)
  |> Vl.encode_field(:y, "data_median", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_q1", type: :quantitative)
  |> Vl.encode_field(:x2, "timeseries_q3", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_q1", type: :quantitative)
  |> Vl.encode_field(:x2, "timeseries_lo_whisker")
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_q3", type: :quantitative)
  |> Vl.encode_field(:x2, "timeseries_up_whisker")
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:y, "data_q1", type: :quantitative)
  |> Vl.encode_field(:y2, "data_lo_whisker")
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:rule, size: 2, ticks: true, tooltip: true)
  |> Vl.encode_field(:y, "data_q3", type: :quantitative)
  |> Vl.encode_field(:y2, "data_up_whisker")
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_lo_whisker", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_up_whisker", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true, orient: :horizontal)
  |> Vl.encode_field(:y, "data_lo_whisker", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:tick, color: :black, size: 16, thickness: 2, tooltip: true, orient: :horizontal)
  |> Vl.encode_field(:y, "data_up_whisker", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_lo_outlier", type: :quantitative)
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:x, "timeseries_up_outlier", type: :quantitative)
  |> Vl.encode_field(:y, "data_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:y, "data_lo_outlier", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(data)
  |> Vl.mark(:point, color: :red, tooltip: true)
  |> Vl.encode_field(:y, "data_up_outlier", type: :quantitative)
  |> Vl.encode_field(:x, "timeseries_median", type: :quantitative)
])

Please refer below screen shot for the sample 2D boxplot which we are proposing as an enhancement:

domoritz commented 2 years ago

2D box plots make sense. We talked about them when we first added box plot support as well https://github.com/vega/vega-lite/pull/2264#discussion_r113588771. Not sure what happened with the discussion. Maybe @kanitw remembers.

Depending on whether we want them in Vega-Lite, would you be willing to contribute a pull request for this feature?

saurabh-ironman commented 2 years ago

@domoritz Yes i would like to contribute on this and would love to submit a PR with the changes. Please let me know if there is any specific process i need to follow for this. Also if there are development documents which will give an idea of the code base and how to make changes and PRs. please point to that... Thanks!

domoritz commented 2 years ago

Take a look at the contributing.md file.

saurabh-ironman commented 2 years ago

@domoritz Thanks! i will start going through this document and understand the process.

domoritz commented 2 years ago

You will only need to modify the normalizer. This is where we implement box plots.

saurabh-ironman commented 2 years ago

@domoritz sure thank you for the pointers :)

kanitw commented 2 years ago

I have questions whether Boxplot for 2D like provided in the screenshot is particularly useful.
Even the article linked doesn't seem to suggest this particular design.

saurabh-ironman commented 2 years ago

@kanitw Thanks for the question. Let me explain you my idea/reason for suggesting 2D boxplot as a new mark inclusion in vega lite. This answer may be a bit lengthy but covers different aspects of proposing 2D boxplot mark.

Details/Description:

Traditional or 1D boxplot helps in visualizing distribution data from millions of systems but this visualization corresponds to only one axis, horizontal or vertical.
With multiple datasets, we want to be able to visualize more than one dataset at a time that means more than one scale or axis.
Idea here is to represent one set of data distribution at x axis and another data distribution at y axis.
This will help us to visualize the impact of one dataset over another data set
- for example we have an application/utility running on multiple system and we want to visualize and/or visualize CPU usage of utility along with utility runtime.
This is where 1D boxplot will have limitation to visualizing either runtimes or CPU usage at a time against one axis.
In this article, Rangefinder Plot and Bagplot helps in visualizing more then one dataset along each axis. By using 2-way whiskers, outlier, and medians we will be able to answer some of the most obvious queries as listed below. Of course in this article authors has used their own marks to visualize data, however we would like to use same traditional boxplot mark to visualize 2-dimensional data.
- What is the average runtime or CPU usage of application/utility during the execution time?
- How many outliers are there? Is there any case when utility took longer than expected to complete the run?
- When CPU usage is high during mid execution of utility or maybe at the end of execution. This will help team to investigate further what is going wrong at the end of utility execution, what functions or operations are causing this CPU spike etc. etc.
Refer below screen shot for sample example:

Yes, Boxplot has some drawbacks. Should you stop using it?

Box plots require audiences to grasp complex concepts that they don’t need to understand in most cases.
Box plots conceal information that’s usually crucial to see.
Boxplots hides the multimodality and other features of distributions
Reference links
- ive-stopped-using-box-plots-should-you
- the-box-and-whisker-plot-for-grown-ups

So yes, boxplot has some well-known drawbacks, but it is STILL in pretty much all charting/graphing packages because it is still in the end useful for data visualization. Thus, 2D boxplots are worthy inclusion.

Conclusion/summary:

2D boxplot can greatly help in analysis of data skewedness. It is more helpful for the engineering team trying to analyze more than one dataset at a time and trying to corelate/understand data patterns in conjunction with each other.
Despite of having drawbacks, Boxplot is still useful for data visualization and would be a good addition to Vega lite library.

Hope this helps you in understanding the idea for this enhancement.

Thanks, Saurabh

czrpb commented 2 years ago

Hi! There are a couple of us working on this and I would like to emphasize the point that the goal is to come up with a visualization that helps understand distribution & variation in 2d.

So, we provided the link to some examples of using area in 2d to show aggregates; ex: bagplot. But, there does not seem to be a "go to" or standard chart for distribution in 2d. The above reply says that instead of creating a new visualization we feel that if boxplots are still an acceptable aggregate visualization, then extending them to 2d should be preferable to coming up with a new one; we think it would be a useful addition to the examples given in the Distributions section of: https://vega.github.io/vega/examples/

Also, to describe the example/data we are working with a bit: Imagine we have processes that start and stop during some timeframe. During this time we have a 2nd set of data (in the example it is CPU utilization). What we want to show is that for many instances of these processes and their attendant 2nd measure (CPU%), what is the distribution of start/end and 2nd measure (CPU%) variation. So, the example above shows that for Process A its start/end IRQ is ~3seconds to ~12.5seconds and the CPU utilization IRQ during this timeframe is ~19% to almost 60%. Now, imagine adding additional processes to this chart or faceting per process, both of which we have implemented. After visualizing multiple Processes and their 2nd measure (CPU%), we can see that we should wonder about those Process with a "large" area: These are Processes with large variabilities in duration and utilization, why?

Im sure we could have a fantastic discussion about other visualizations for this, but hopefully here we have given some reasonable justification for 2d boxplots (even if ultimately not accepted! :) ).

kanitw commented 2 years ago

Thanks for more explanation.

So yes, boxplot has some well-known drawbacks, but it is STILL in pretty much all charting/graphing packages because it is still in the end useful for data visualization. Thus, 2D boxplots are worthy inclusion.

I think everyone would agree that standard (1D) boxplots are useful despite its drawbacks.

However, my problem with variants of 2D boxplots is that there doesn't seem to be a universally accepted one.
There are many variants, but it's unclear which one is the one worth implementing.

I strongly think that it's probably better if the effort is spent on bring contour plots, which seems to be more widely used, to Vega-Lite.

Note that I'm totally ok if you want to add some variants of these 2D boxplots as Vega or Vega Lite examples. They could be useful for people who may prefer specific variants of 2D box plots. However, to say that we support them as built-in boxplot type, I'd like to see a more concrete proposal for which 2D boxplot to implement, and why we should prioritize such chart type over a contour plot.

vega / vega-lite

2D Boxplots #7822

Details/Description:

Yes, Boxplot has some drawbacks. Should you stop using it?

Conclusion/summary: