vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.26k stars 793 forks source link

ENH: SortField shorthand #884

Closed palewire closed 9 months ago

palewire commented 6 years ago

Yesterday, a colleague asked me how to dictate the sort of bars in a chart. I developed this example to show him how.

download

alt.Chart(df, title="Median household income of U.S. counties").mark_bar().encode(
    x=alt.X(
        "name:N",
        axis=alt.Axis(labels=False, title="", ticks=False),
        # Here's where you can resort the order of the columns on the x-axis
        sort=alt.SortField(
            # This SortField class requires at least three inputs,
            # which does seem like overkill. I'd like to see a simpler
            # way to pull this off.
            field='b19013001',  # First the field you want to sort on 
            op='sum',  # Then the operation to run on that field. In this case, we just total the value.
            order="descending"  # Finally, the order to sort.
        )
    ),
    y=alt.Y(
        "b19013001:Q",
        axis=alt.Axis(title="", format="$s", ticks=False)
    )
).properties(width=620) 

It works great but, IMHO, the SortField requirement with three inputs, including a "fake" op that in this case does not appear to be necessary, is asking a lot of beginners. And I'd like to think something more convenient could also benefit experts.

I know nothing about the internals of this feature, but I'm curious if the sort channel could somehow benefit from a shorthand, much like the x and y channels.

In my imagination, something like this:

sort=alt.SortField(field="b19013001", op="sum", ordering="descending")

Could be submitted like this, with the field and operation handled much like the other shorthand features, and the descending order of the sort handled with the same style as the order_by method of the popular Django framework:

sort="-sum(b19013001)"

I'm guessing you can easily imagine the other permutations in this kind of scheme. Additionally in cases where the dataframe is not grouped during encoding, it seems to me that providing the op argument should be, :drum:, optional. That would mean that if a field was to be used as the sort in ascending order with no aggregation, the shorthand submission could be as simple as:

sort="b19013001"

What do you think? If something like this already exists and I'm simply ignorant of it I will accept writing the documentation as my punishment.

ellisonbg commented 6 years ago

In teaching Altair, this question comes up a lot. I like your solution and I think it is consistent with the existing shorthand.

On Tue, May 22, 2018 at 7:19 AM, Ben Welsh notifications@github.com wrote:

Yesterday, a colleague asked me how to dictate the sort of bars in a chart. I developed this example https://github.com/datadesk/altair-column-sort-example/blob/master/notebook.ipynb to show him how.

[image: download] https://user-images.githubusercontent.com/9993/40367924-5f658f52-5d8f-11e8-94b2-6a46af67b80c.png

alt.Chart(df, title="Median household income of U.S. counties").mark_bar().encode( x=alt.X( "name:N", axis=alt.Axis(labels=False, title="", ticks=False),

Here's where you can resort the order of the columns on the x-axis

    sort=alt.SortField(
        # This SortField class requires at least three inputs,
        # which does seem like overkill. I'd like to see a simpler
        # way to pull this off.
        field='b19013001',  # First the field you want to sort on
        op='sum',  # Then the operation to run on that field. In this case, we just total the value.
        order="descending"  # Finally, the order to sort.
    )
),
y=alt.Y(
    "b19013001:Q",
    axis=alt.Axis(title="", format="$s", ticks=False)
)

).properties(width=620)

It works great but, IMHO, the SortField requirement with three inputs, including a "fake" op that in this case does not appear to be necessary, is asking a lot of beginners. And I'd like to think something more convenient could also benefit experts.

I know nothing about the internals of this feature, but I'm curious if the sort channel could somehow benefit from a shorthand, much like the x and y channels.

In my imagination, something like this:

sort=alt.SortField(field="b19013001", op="sum", ordering="descending")

Could be submitted like this, with the field and operation handled much like the other shorthand features, and the descending order of the sort handled with the same style as the order_by https://docs.djangoproject.com/en/2.0/ref/models/querysets/#order-by method of the popular Django framework:

sort="-sum(b19013001)"

I'm guessing you can easily imagine the other permutations in this kind of scheme. Additionally in cases where the dataframe is not grouped during encoding, it seems to me that providing the op argument should be, 🥁, optional. That would mean that if a field was to be used as the sort in ascending order with no aggregation, the shorthand submission could be as simple as:

sort="b19013001"

What do you think? If something like this already exist and I'm simply ignorant of it I will accept writing the documentation as my punishment.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/altair-viz/altair/issues/884, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0MNiXDB9g9vJXN2Pai4T2vKaxG_Yks5t1B5wgaJpZM4UIvf_ .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

JoeGermuska commented 6 years ago

As the aforementioned colleague, I'll start by admitting that I'm an Altair newb. I still haven't wrapped my head around the values/function of the op parameter. As I told @palewire , I'm having trouble intuiting why I would choose 'sum'.

That said, the syntax Ben suggests seems clear and concise, especially if the op becomes optional in common/simple cases.

jakevdp commented 6 years ago

Thanks for bringing this up... I agree that the grammar is a bit complicated in this case. Perhaps it would make sense to raise an issue in Vega-Lite and recommend that op be made an optional argument?

Regarding adding new shorthand parsing... I'm a bit wary of that, because every extra piece of logic that we add on top of the schema is one more thing that can (and will) break during a future vega-lite update. Do you think that making op optional within alt.SortField would do enough to clarify things for users?

kanitw commented 6 years ago

We thought about this before, but it is unclear what's a reasonable default. If you think this should be done, feel free to discuss more in https://github.com/vega/vega-lite/issues/1489.

palewire commented 6 years ago

I respect your reticence to venture too far away from Vega, but I'm curious how you properly judge the different opportunities to introduce shorthand.

My novice understand of Altair leads to me to believe there are some cases where this has been done as a convenience to users. Is there a list of them anywhere?

jakevdp commented 6 years ago

Currently the only place such shorthands have been introduced is in the encode() method, and in a couple of the transform_*() methods.

palewire commented 6 years ago

Do you see all of the x and y kwargs other than field and type being off limits to shorthand?

jakevdp commented 6 years ago

I wouldn't say they're off-limits... I'd just say we need to think carefully about where to draw the line on what parts of Altair exactly mirror the Vega-Lite API and what parts diverge.

Just for background: the way the shorthand expressions work is:

  1. Subclass all encoding channel classes
  2. Add an attribute that is invalid according to its schema
  3. specialize the to_dict() method so that it detects the presence of this attribute, removes it, and interprets its contents into a form that is valid according to the schema (in this case, populating the field, type, aggregate, and timeUnit attributes).

This customized code depends on the details of the schema, and so when the schema is updated the details of these modifications have to be updated as well. For example, the Vega-Lite version 1 and Vega-Lite version 2 schemas were so different that it required essentially rewriting the code from scratch, which all told took about 8 months to really get correct. Along the way, I dropped a number of other API shortcuts we had created earlier because I saw how unmaintainable they were when it came to schema updates.

I think overall it's good to have those encoding shorthands available at the top level of the encoding... it's something that's used in basically every chart, and so the added maintenance burden is worth it. For any other API changes that require circumventing the grammar of the Vega-Lite schema, I want to make sure we're carefully weighing the benefit to users vs the costs of the new maintenance burdens they create.

So no, nothing's off-limits per se, but there's a lot to keep in mind when making these kinds of decisions.

palewire commented 6 years ago

I see your point. Thanks for explaining it all for me.

Since the shorthand is so useful, I wonder if it's worth considering if Altair should develop some kind of modular framework within itself for the system.

Do you think it would be possible to abstract back the existing hassle of adding new shorthands to something more literate, extensible and maintainable?

jakevdp commented 6 years ago

Maybe... my best attempt at making it modular is here, in the code generation tools, where we automatically generate wrappers for schema objects for which we want to modify the default behavior: https://github.com/altair-viz/altair/blob/master/tools/generate_schema_wrapper.py#L245-L293

There's a lot in there that is "hard-coded", so when the schema changes it takes a bit of hunting to figure out why things aren't working any more.

jakevdp commented 5 years ago

Partly addressed in Altair 3, where the aggregate becomes optional.

I still think it may be useful to allow a shorter syntax, like sort='column' rather than sort=alt.EncodingSortField('column')

kanitw commented 5 years ago

Maybe... my best attempt at making it modular is here, in the code generation tools, where we automatically generate wrappers for schema objects for which we want to modify the default behavior: /tools/generate_schema_wrapper.py@master#L245-L293

There's a lot in there that is "hard-coded", so when the schema changes it takes a bit of hunting to figure out why things aren't working any more.

I think it's worth knowing what are the things that Altair still diverges from Vega-Lite, so we can revise our defaults, esp. for the upcoming VL4.

I still think it may be useful to allow a shorter syntax, like sort='column' rather than sort=alt.EncodingSortField('column')

Yep, I have an issue that you can upvote in VL here: https://github.com/vega/vega-lite/issues/4933.

joelostblom commented 9 months ago

It is now possible to do .sort(field='column'), which is quite convenient so closing this issue.

import altair as alt
from vega_datasets import data

source = data.barley()[:5]

alt.Chart(source).mark_bar().transform_calculate(
).encode(
    x='yield',
    y=alt.Y('site').sort(field='yield')
)