vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.31k stars 793 forks source link

maxbins breaks density plot #1932

Closed dchudz closed 4 years ago

dchudz commented 4 years ago

If I do this:

import pandas as pd
import altair as alt
import numpy as np

N = 20
plot_df = pd.concat([
    pd.DataFrame({'x': np.linspace(1, 1.1, num=N), 'Type': ['A' for _ in range(N)]}),
    pd.DataFrame({'x': np.linspace(10, 11, num=N), 'Type': ['B' for _ in range(N)]})])

alt.Chart(
    plot_df,
).transform_density(
    'x',
    as_=['x', 'density'],
    groupby=['Type']
).mark_area(
).encode(
    alt.X('x'),
    alt.Y('density:Q'),
    alt.Color('Type:N')
)

...then I get the plot I expect:

Screen Shot 2020-01-26 at 9 04 36 PM

But if I use maxbins then it's broken:

alt.Chart(
    plot_df,
).transform_density(
    'x',
    as_=['x', 'density'],
    groupby=['Type']
).mark_area(
).encode(
    alt.X('x',
          bin=alt.Bin(maxbins=50),
         ),
    alt.Y('density:Q'),
    alt.Color('Type:N')
)

Screen Shot 2020-01-26 at 9 04 48 PM

I assume this really a Vega-Lite issue (or my own misunderstunding... sorry if so), but I don't know anything about using Vega-Lite outside of Altair, so filing it here.

dchudz commented 4 years ago

Or this (similar to this example from the docs: https://altair-viz.github.io/gallery/layered_histogram.html):

alt.Chart(
    plot_df,
).mark_area(
    interpolate='step'
).encode(
    alt.X('x',
          bin=alt.Bin(maxbins=150),
         ),
    alt.Y('count()', stack=None, scale=alt.Scale(domain=[0,20])),
    alt.Color('Type:N')
)

Screen Shot 2020-01-26 at 9 18 31 PM

But if I change maxbins to 50, it's broken:

alt.Chart(
    plot_df,
).mark_area(
    interpolate='step'
).encode(
    alt.X('x',
          bin=alt.Bin(maxbins=50),
         ),
    alt.Y('count()', stack=None, scale=alt.Scale(domain=[0,20])),
    alt.Color('Type:N')
)

Screen Shot 2020-01-26 at 9 18 55 PM

jakevdp commented 4 years ago

I suspect this issue stems from a misunderstanding of what the density transform does. The transform returns an adaptive x-grid along with the density computed at each point. When you bin the result in x along with a count() aggregate, the result is a bar chart showing the number of grid points within each bin. Of course, if you have many more bins than grid points, most of the values will be zero, because most bins do not contain a grid point.

I'm going to close this as it is working as intended – feel free to re-open if you still have questions.

dchudz commented 4 years ago

@jakevdp thanks, I appreciate the help but I'm confused so reopening (appreciate the offer):

It still seems like there's a bug with the binning. My examples were probably unclear (sorry) so here's an example I took straight from the docs (with small changes) and no density transform:

import pandas as pd
import altair as alt
import numpy as np

N=20

# just like this example (https://altair-viz.github.io/gallery/layered_histogram.html),
# except I replaced the normals with `np.linspace`, and now N is 20 instead of 1000.
def draw_chart(trial_a_start, trial_a_end):
    # Generating Data
    source = pd.DataFrame({
        'Trial A': np.linspace(trial_a_start, trial_a_end, num=N),
        'Trial B': np.linspace(10, 11, num=N),
        'Trial C': np.linspace(11, 12, num=N),
    })

    return alt.Chart(source).transform_fold(
        ['Trial A', 'Trial B', 'Trial C'],
        as_=['Experiment', 'Measurement']
    ).mark_area(
        opacity=0.3,
        interpolate='step'
    ).encode(
        alt.X('Measurement:Q', bin=alt.Bin(maxbins=100)),
        alt.Y('count()', stack=None),
        alt.Color('Experiment:N')
    )

Screen Shot 2020-01-29 at 8 19 03 AM