vega / vega

A visualization grammar.
https://vega.github.io/vega
BSD 3-Clause "New" or "Revised" License
11.26k stars 1.51k forks source link

Make `sort.op` optional #1321

Open palewire opened 6 years ago

palewire commented 6 years ago

Crossposting from vega/vega-lite#1489

Why is the op keyword on sort required?

Here's an example Jupyter Notebook using Vega via Altair where I see the requirement causing a problem.

The user has a simple table with one nominal value and one quantitative value per row, in this case the median income of each county in the United States.

Each row is encoded into a bar on a chart. The user would like to sort the nominal bars using the y-axis' quantitative value with no transformation or aggregation.

To the user in that circumstance, it seems extraneous to have to submit any operation at all.

download

alt.Chart(df, title="Median household income of U.S. counties").mark_bar().encode(
    x=alt.X(
        "name:N",
        axis=alt.Axis(labels=False, title="", ticks=False), 
        sort=alt.SortField(
            field='b19013001',
            op='sum',  # <-- Why is this necessary?
            order="descending" 
        )
    ),
    y=alt.Y(
        "b19013001:Q",
        axis=alt.Axis(title="", format="$s", ticks=False) 
    )
).properties(width=620)

If the chart is not aggregated, why should the user have to specify an aggregation?

Am I crazy to think that a sensible default would be that if no aggregation function is provided Vega should assume there in a 1:1 relationship between the axis and the sort, perhaps raising an error if there isn't?

jheer commented 6 years ago

There has to be some aggregate here, as there is no guarantee that there are not multiple records for values in the scale domain. So the question becomes: should there be a default aggregate operation and if so what should it be? Min? Max? Something else?

The "virtue" of including the op is that (1) it makes it clear what is being done, hopefully preventing future confusion at the cost of some upfront learning, (2) while options like "sum" and "average" are only applicable to numeric values (and so not suitable as default operations), they permit more efficient streaming operations than "min" or "max", thus enabling more performant visualizations. (To deal with possible value removals, min/max must keep a list of all data records seen.)

As a result of the above I'm inclined to keep the design as-is, though I welcome more discussion. Another option is for Vega-Lite to make it's own decision here, and keep Vega as-is regardless.

palewire commented 6 years ago

Would it be possible to make the argument optional, and then raise an error if no aggregate is provided and there is not a 1:1 relationship?

jheer commented 6 years ago

There is a specialized pipeline in place for building out scale domains, including cached data structures are that are reused to optimize processing. It might be possible to change this, but it would require significant efforts/testing. I'm not sure it is warranted given the nature of the issue. Moreover, I'm unconvinced that trading off a design-time issue for a run-time error is ultimately a good move.

palewire commented 6 years ago

Alrighty. Everything is a trade off, of course.

It might be unavoidable, but I think you can expect this to be a slight irritation to upstream users with the use case sketched out above. Perhaps more thorough documentation can help ease that pain.

I also expect it will be difficult to settle on a default aggregate operation that will feel sensible to all users. So you might consider requiring users to choose an input so that they don't unintentionally make inaccurate charts.

CMCDragonkai commented 6 years ago

The documentation should mention that if you just want to do a simple sort on some quantitative value, just use the sum aggregate.