vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.36k stars 793 forks source link

Make `.stack()` work the same as `.bin()` #2982

Open joelostblom opened 1 year ago

joelostblom commented 1 year ago

With the new method-based encoding option syntax, it is enough to write .bin() instead of .bin(True), but it seems like we still have to write .stack(True) as just .stack() does not work. Not sure if this is an easy fix of not. Example spec:

import pandas as pd
import altair as alt

source = pd.DataFrame(
    {"category": ["a", "b", "c", "d", "e", "f"], "value": [4, 6, 10, 3, 7, 8]}
)

alt.Chart(source).mark_arc(outerRadius=120).encode(
    theta=alt.Theta("value:Q").stack(True),  # just stack() does not work
    color=alt.Color("category:N").legend(None),
)

The error with just .stack() is:

SchemaValidationError: '{}' is an invalid value for `stack`:

{} is not one of ['zero', 'center', 'normalize']
{} is not of type 'string'
{} is not of type 'null'
{} is not of type 'boolean'
binste commented 1 year ago

I can see that it's a bit more convenient to just type .bin() and .stack() and as they are verbs one can read it as "telling" the chart to "bin" the data for that encoding channel. However, the same is not true for other methods and so I'm not sure if this mixes mental models and could make it confusing what these methods actually do. For example .legend does not add a legend, it is used to configure the legend option of the Color encoding channel.

Definitely good to align it to be consistent so either .option() sets that option to true for all methods or for none. Maybe it's also just me and this is not confusing. Curious to hear what you think.

mattijn commented 1 year ago

I think it's good to discuss this, but indeed to a wider scope. I had the same with aggregates, instead of alt.X("mean(foo)") to introduce something such as alt.X(field="foo").mean(), or alt.X().agg("mean")

joelostblom commented 1 year ago

However, the same is not true for other methods and so I'm not sure if this mixes mental models and could make it confusing what these methods actually do. For example .legend does not add a legend, it is used to configure the legend option of the Color encoding channel.

This is an interesting point, but I think it is unavoidable now that we enabled the method based syntax which essentially mixes methods that add something (bin, stack, impute...) with methods that configure the option of something that is already in the chart (legend, scale, axis) (and this was true for the parameters before too). It is probably good to keep this in mind, but I am not sure that there is any chance in the syntax of those methods that would make this distinction clearer. For example, changing .bin() to .bin(True) would still not have any analogue in .legend() because we would never need to do .legend(True) and the way to disable a legend is not with False but rather with None.

I am open to ideas here, and I agree the current state is a bit messy due to how VegaLite handles {} differently for different parameters. Here is how it looks for all the encoding methods we have currently

# Set the value to `{}`, which has no effect 
.axis()
.legend()
.scale()

# Set the value to `{}`, which has an effect 
.bin() Bin with the default number of bins
.impute() Seems to impute 0s?
.sort() Disables VLs automatic sorting and preservers original data order (we should probably change this)

# Set the value to `{}`, which raises an error
.aggregate()
.bandPosition()
.title()
.type()
.field()
.stack()
.timeUnit()

I had the same with aggregates, instead of alt.X("mean(foo)") to introduce something such as alt.X(field="foo").mean(), or alt.X().agg("mean")

I have been thinking about this too! Having a .agg method could be a concise way to aggregate while also showing a helpful doscstring containing all the aggregate options, which are now much harder to find since the user has to good to the documentation to see what is available. I think having the individual .mean etc methods might make it a bit more cluttered since there are so many, even if it would be nice to have that same syntax many people are used to from pandas (although agg('mean') is also valid pandas, just less common). If we go this route, we should consider having the timeunit transforms be part of the agg() method as well, similar to both can now be used inside the field string.