vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.26k stars 793 forks source link

Non-deterministic behavior with `mark_boxplot` #1496

Closed wmayner closed 4 years ago

wmayner commented 5 years ago

The following code will not order the categories on the x-axis the same way on repeated runs:

import pandas as pd
import numpy as np
import altair as alt

d = pd.DataFrame({
    'x': np.repeat(np.arange(3), 10),
    'y': np.concatenate([np.random.normal(i, 0.1, 10) for i in range(3)]),
    'z': np.repeat(np.arange(3, 0, -1), 10),
}).sort_values(
    'z',
)

alt.Chart(
    d,
).mark_boxplot(
    opacity=1.0,
).encode(
    x=alt.X('x:N', sort=None),
    y='y:Q',
    color='z:N',
)

This does not seem to happen with mark_circle.

My understanding was that using sort=None with alt.X would respect the order in the data (i.e., z).

jakevdp commented 5 years ago

It appears that the non-determinism is coming from the data, not from Altair (repeating the plot with the same dataset always results in the same sort order). It seems that when sort is set to None, the exact order of the axes depends on the contents of the data, which seems reasonable to me. If you want a specific sort order that is consistent across different input datasets, you can specify it explicitly.

wmayner commented 5 years ago

I thought that might be what's going on, but I ruled it out because in all my repeats I never observed a case where a y value for x = 2 was lower than for x = 1; but then shouldn't x = 2 always be first (or last, depending on the convention that Altair's using)? Sometimes it's in the middle even though it has the highest values of y. Sorry if I'm missing something obvious here.

jakevdp commented 5 years ago

I don't think there's any obvious reason why it would be sorted one way or the other... it's more of an implementation detail of Vega/Vega-Lite.

My understanding is this: when you say sort=None, you're explicitly saying that the sort order does not matter, so internal implementation details may affect the order. If the order is important to you, you should supply a sort argument that is not None.

wmayner commented 5 years ago

I see. Then I guess this is more of a feature request—it would be nice to have a way of specifying that the order I want is the order that the data appears in. But it sounds like this may be a pain to implement, given that it depends on Vega-Lite implementation details.

Incidentally, the reason I want this is because I couldn't find out how to sort on multiple columns with EncodingSortField.

jakevdp commented 5 years ago

I guess I'm not understanding what you need... in the example you gave, you can specify whatever sort order for x you wish in the chart spec. Can you give an example of where you're unable to do that?

wmayner commented 5 years ago

Yes, with one sorting variable, Altair's sorting works fine, but if there were another column—say, w—then how can I sort on w first and then z (i.e., df.sort_values(['w', 'z'])) with EncodingSortField? As far as I can tell it takes only a single field.

michaelaye commented 5 years ago

I just discovered that this doesn't work, presumably because this boxplot implementation, like so many others, do not work on non-consecutive intervals on the x-axis?

import altair as alt
from vega_datasets import data

source = pd.read_json(data.population.url)

alt.Chart(source.sample(100)).mark_boxplot(extent='min-max').encode(
    x='age:O',
    y='people:Q'
)

The plot does not show, no error message either.

jakevdp commented 5 years ago

@michaelaye When I run your code, I see this (using Colab with the most recent version of Altair: https://colab.research.google.com/drive/1C36B4r9Wo_OWHDqlxxabp_mj-YyWsOLE) visualization - 2019-07-12T215333 213

Can you share more details about what frontend you're using (JupyterLab, Jupyter Notebook, Nteract, Vegascope, etc.), what version of Altair, and whether there are any error messages in the Javascript console?

jakevdp commented 5 years ago

(Oh, totally separately: you can use data.population() in place of pd.read_json(data.population.url))

michaelaye commented 5 years ago

Interesting, I just had a passing case, but it was again when all bins were filled, so I think my presumption is correct. It's really fascinating that nobody has a working implementation of boxplot over time/non-consecutive data points.

Sure can provide more info, I was saving a detailed report for a new issue if you think it's warranted:

Here's the console error:

Screenshot 2019-07-13 15 58 10

My system:

conda info:


     active environment : py37
    active env location : /Users/klay6683/miniconda3/envs/py37
            shell level : 4
       user config file : /Users/klay6683/.condarc
 populated config files : /Users/klay6683/.condarc
                          /Users/klay6683/miniconda3/envs/py37/.condarc
          conda version : 4.7.5
    conda-build version : not installed
         python version : 3.7.3.final.0
       virtual packages : 
       base environment : /Users/klay6683/miniconda3  (writable)
           channel URLs : https://conda.anaconda.org/michaelaye/osx-64
                          https://conda.anaconda.org/michaelaye/noarch
                          https://conda.anaconda.org/conda-forge/osx-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/osx-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/klay6683/miniconda3/pkgs
                          /Users/klay6683/.conda/pkgs
       envs directories : /Users/klay6683/miniconda3/envs
                          /Users/klay6683/.conda/envs
               platform : osx-64
             user-agent : conda/4.7.5 requests/2.22.0 CPython/3.7.3 Darwin/18.6.0 OSX/10.14.5
                UID:GID : 273771:2260
             netrc file : None
           offline mode : False
jakevdp commented 5 years ago

Interesting... I can't reproduce that at all. Ran it several dozen times to get different random seeds. I tried running with smaller samples to try to reproduce your hypothesis of it being due to non-contiguous bins. I'm not sure how to help since I can't reproduce the issue myself.

Can you specify a particular random seed for which you see this problem?

(Also, your presumption that "nobody has a working implementation of boxplot over time/non-consecutive data points" is not accurate or useful here)

michaelaye commented 5 years ago

(Also, your presumption that "nobody has a working implementation of boxplot over time/non-consecutive data points" is not accurate or useful here)

You may misread my intent. This is not a complaint but a trial in understanding my failure in identifying a plotting library that shows plotted boxes placed mathematically instead of categorically in a box-plot. And in that way stating this presumption might indeed be helpful because it might point to a mismatch in expectation what this kind of plot actually is supposed to do.

Not accurate? Please point to a plotting library that does plot boxplots over x-axis positions according to their values instead of equi-distantly placed categories? I checked matplotlib, seaborn, plotly, holoviews, bokeh. Also your above plot shows regular x-axis points without any bin missing.

To be more clear, I removed the ages of 25:

subsample = source.query("age!= 25")

and plotted above code using that:

Screenshot 2019-07-13 19 14 28

This does not throw an error but does not what I need: In this case, I expect a hole, no box placed at the x-axis value of 25; instead the box for 30 appears where, mathematically, 25 should be. In other words, the boxes are placed in a non-mathematical way as pure category bins, not at their mathematical correct linear position. Not sure how to say it differently.

jakevdp commented 5 years ago

OK, so are you no longer seeing the error you reported?

michaelaye commented 5 years ago

No, the error is still there, for example using

subsample = source.sample(100, random_state=0)
alt.Chart(subsample).mark_boxplot(extent='min-max').encode(
    x='age:O',
    y='people:Q'
)
jakevdp commented 5 years ago

When I run that code I see this, using the most recent version of Altair: visualization - 2019-07-13T192117 463

jakevdp commented 5 years ago

As to other question about identifying a plotting library that shows plotted boxes placed mathematically instead of categorically in a box-plot, if you would like to force bins without data to be part of the x scale, in Altair you can use the scale domain argument:

subsample = source.query("age!= 25")
alt.Chart(subsample).mark_boxplot(extent='min-max').encode(
    x=alt.X('age:O', scale=alt.Scale(domain=np.arange(0, 95, 5).tolist())),
    y='people:Q'
)

visualization - 2019-07-13T192517 182

michaelaye commented 5 years ago

And with most recent do you mean 3.1.0 or GH master? On which frontend? I tried switching to notebook using the /tree URL for the running JLab server, and it shows the same problem:

Screenshot 2019-07-13 20 35 13
jakevdp commented 5 years ago

I just switched to my mac and tried on safari, and I can see the behavior you reported (it works fine on Chrome and Firefox). It's not an Altair issue, but rather a Vega-Lite issue (You can see it here in the vega editor).

I spent a while trying to find Safari's developer tools to attempt to diagnose the issue, but gave up because it's Saturday night :smile:

I would report this issue on the Vega-Lite issue tracker.

michaelaye commented 5 years ago

Thanks for creating the issue, I had trouble understanding in how to minimize the spec, first needed to learn all the vocab, like "spec". I have one quick question if you allow to abuse this GH issue once more: Why does setting the x type to a quantity not work for getting the plot you created by using the alt.X('age:O', scale=alt.Scale(domain=np.arange(0, 95, 5).tolist())), setting? Isn't it conceptually the same idea, to use age at its face value for the axis scale?

I'm getting this when I try:

Screenshot 2019-07-15 16 58 30

It actually kinda works, b/c one can see that no box-median is drawn exactly where I expect the holes, it's just that the graphic is messed up, so it's getting very close.

jakevdp commented 5 years ago

That looks like a bug in Vega-Lite's boxplot macro. Would you like to report it there?

michaelaye commented 5 years ago

oh, so you are saying my understanding is correct, it should work? Sure can report it.

michaelaye commented 5 years ago

Reported in https://github.com/vega/vega-lite/issues/5259