Closed ericemc3 closed 3 years ago
We did a lot of previous exploration on this topic; some of it is captured in #32 #106 #108. I’m supportive of continuing this effort, but I don’t think it’s easy. I have two fairly strong opinions here. First, it should be explicit and opt-in: you have to declare which mark channel is driving the sort order (or possibly which other scale, and perhaps in either case you have to specify how to aggregate multiple values). Second, we should avoid a string DSL like "-y"
or "median(y)"
because we want this to be extensible with JavaScript.
As a newcomer to Observable Plot (longtime user of ggplot2 and matplotlib/seaborn, light d3 user), I found it counter-intuitive that Plot auto-sorts the order of the categories in a (simple/basic) barchart, especially if in prior data-munging one has already sorted it a certain way.
Is it possible to either turn off this auto-sort by default?
Great work by the way, I'm really looking forward to when you guys bake-in basic interactivity!
Thanks for pushing the question again.
First, it allowed me to discover a bug, which is that Plot.group shouldn't need to sort the groups. I've checked all the test plots and the only ones where removing this sort makes a visible difference are 2 stacked bar charts (in which the sort order could be given if we wanted the stacked groups to be sorted).
This means we can remove this line, and by way of consequence it will become possible to introduce a syntax to turn off the categorical scales auto-sorting, something like x:{sort: "input"}
.
Another possibility that we could add is to sort by count: e.g.x:{sort: "count"}
. This could be particularly useful for facet ordering, and we could even do something like fx: {top: 10}
to retain only the 10 largest modalities of the facet. This is possible because, for facets, there is a unique dataset and channel to consider, explicitly defined in the facet:{} options.
However I can't see a generic logic to sorting "x by y", without making a strong assumption about the data and channels. The assumption in question is made explicit when we write domain: d3.groupSort(data, v => d3.sum(v, y), d => d.x)
or, in pseudo-code: domain(data, y, x)
. In other words, we can sort "x by y" if we identify a dataset, a x channel and a parallel y channel, and a reducer.
When we look at simple bar chart examples like Plot.barY(data, {x, y}), the human eye clearly sees data, x, y:
However Plot 🤖 can't see that as clearly as we 🧠 do. For example, if we have two marks—say, Plot.tickY(["a", "z"]) + Plot.barY(letters, {x, y})—, which of ["a", "z"] and letters is the relevant data? I'm sure you as human 🧠 can see the difference in intent, but in JavaScript🤖?
Another difficult case is when you have, say, a confusion matrix: on the top row {a:80, b: 20}, on the bottom row {a: 40, b:60}. How would you sort Set{a, b}?
Plot as a foundation needs to be generic, so I think we can't go beyond the "ascending", "input" and "count" domain sorts that would receive the joint x channels of all the marks (in my toy example, this would be [ ["a", "z"], letters ], or flattened as ["a", "z", ...letters ]). (Note that "count" would not give us the count of values inside the groups, but the number of times each group appears — that is, for a generic bar chart: 1; unless we have faceting or series defined with z, fill or stroke.)
In parallel or on top of that, it would probably make sense to build a "histogram" function (which can be developed initially as a plugin), that makes opinionated assumptions for common use cases. Maybe it could have a "main" data object, x and y, and as such have a syntax very similar to Plot.barY(data, {x, y}).plot(), and specific options that indicate which domain definition strategy to use.
Another thing we could improve is the syntax of d3.groupSort(data, v => d3.sum(v, y), d => d.x)
, with some sugar give it a Plot flavor (field accessors, a function exposed as Plot.groupSort, etc).
PS: I hope my demonstration is flawed, because a generic "sort x by y" would be super useful. Please someone prove me wrong.
For example, if we have two marks—say, Plot.tickY(["a", "z"]) + Plot.barY(letters, {x, y})—, which of ["a", "z"] and letters is the relevant data?
I would select the largest domain, in this case letters
, as the base one, and test the other ones: either they are a subset of the largest, either you can insert their values with a bisector operation, keeping unchanged the largest domain's order?
I can't visualize your row {a:80, b: 20} / row {a: 40, b:60} example, could you please elaborate?
ok, thanks for the image, obviously faceting is incompatible with a descending y sort, since you share the x axis (and domain order) between all facets. So i guess a sort('-y') in a facet context should just be ignored
One could say that the confusion matrix is not much different if presented as facets or as stacked bars, so we could compare the totals. I guess my point with this example is not to give a straight answer, but to show that if we can argue in favor of a particular heuristic for each new case, it's hard to make a system—having the explicit groupSort definition has the merit of being unambiguous. And as I said earlier common patterns (such as "a histogram ordered by y and retaining the top n modalities of x") could be turned into a specific function, plugin, or code snippet.
I understand, d3.groupSort or d3.sort is fine for JS and d3 experts as we are, but i still think that this kind of complexity for a basic ordered bar chart could put off many newcomers, especially those accustomed to the -relative- simplicity of ggplot2 and vega lite. It is really a matter of strategic positionning, Is Plot intended for D3 and JS experts, and it will definitely help them save a lot of time, or also for a wider audience?
Hi @Fil thanks for your very detailed response!
If I can understand you properly, it seems the faceting will be an issue if auto-sort is turned off? Is there any reason why Plot would need to sort the categorical channel in the first place? I would think this would make it more generic; leave the data munging to the user and just plot what is given?
PS Love your work over last observablehq.com!~
Here are examples that "group X by Y", so we can think about the syntax: https://observablehq.com/@fil/plot-group-x-by-y The first one introduces a function domainY that emulates the syntax of Plot.barY, and thus feels "more plot-like". The second one wraps the first in a cheeky histogram() function. The third one is the groupSort syntax.
Pull request #414 addresses the "input order" part of this discussion.
It also allows to order the domain of an ordinal scale x or fx by the number of marks that make use of each element of the domain. It works well to sort facets, or an ordinal scale for a Plot.dot chart (but it does not solve the "sort x by y" for bar charts).
Please review #442 and test it at https://observablehq.com/@fil/order-x-by-y-442
Hi,
Circling back to this I'm not sure the original issue was solved fully?
I think my basic question is, if the data being passed to Plot.plot
is already sorted in a certain order, why does the function then sort it again?
See this notebook: https://observablehq.com/d/9fabea8a02470141
Basically, I have to specify in the relavant scale options the full domain of that axes to sort it as it appears in the original dataset, which seems counter-intuitive especially if the data has been pre-munged with a specific sort order in mind.
This is the default sort. You can indicate in the mark that you want a null domain sort:
Plot.barY(letterTable, { x: "letter", y: "count", sort: {x: null}})
And to add to what @Fil said, the reason we do this is that you can have multiple marks and multiple channels bound to the x scale, so if we were to take the input order by default, the order of the x scale domain would be dependent on the order of marks (and potentially the arbitrary order in which a mark’s channels are listed, e.g. in the case of a mark having x1 and x2 channels). So by adding sort: {x: null}
to the mark in question, you’re being more explicit about which mark and channel should be used to determine the domain order. For example, if you added a rule as annotation before the barY, you probably wouldn’t want that to change the domain!
Thanks for the prompt and detailed explanation!!
It is common to have to sort marks according to another channel: vertical or horizontal bars sorted by decreasing size for example.
Vega-Lite offers a very simple syntax to do this:
sort("-y")
, to sort for example markBars according to channely
, and ggplot2 offers a similar syntax (x = reorder(...)
).With Plot, it's more complex, especially in the context of grouping, with domain specifications via
d3.groupSort()
that require a fine-grained knowledge of these advanced D3 functions, and I'd rather avoid this kind of dependency to D3.I would therefore suggest a syntax like this, for a barY mark, for example:
or
And, more advanced (for instance sorting facets according to the median of each group) :
PS: Plot is a great library, very intellectually stimulating, and the 21 beautifully written articles in the https://observablehq.com/collection/@observablehq/plot collection should be read by all data scientists!