Better default axis number format

kanitw commented 8 years ago

SI seems confuses users a lot!

domoritz commented 8 years ago

Maybe we shouldn't bother for the default but for voyager/polestar use scientific notation to avoid super long labels.

kanitw commented 7 years ago

[ ] Also make sure that bin range are not weird as described in #1757

kanitw commented 7 years ago

[ ] Make sure this case works https://github.com/vega/vega-lite/issues/1459

jheer commented 7 years ago

OK, I just did some basic testing with Vega 3's default settings for linear, log and sqrt/pow scales. All have pretty reasonable defaults as-is (with no special formatting directive applied), but could be further refined in how "extreme" values are handled.

linear, sqrt, pow - These all use the same default setting, which generates a linear set of ticks with precision automatically configured by the scale domain. Problems arise when the numbers get very big or very small: the labels grow exceeding long with leading or trailing zeros. These labels switch to exponential notation once values are greater than or equal to 1e+21.
log - This uses a special formatter that determines precision in a non-linear fashion. It automatically switches to using exponential labels once the digit precision exceeds a certain threshold: 1e-7 for small numbers, 1e+12 for large numbers.

Questions:

For linear, etc should we institute similar policies as we use for log (or vice versa) and switch to exponential (scientific) notation for large/small numbers at the same threshold values?
What cutoffs should we use for switching notation? For example, the log formatter currently uses a default precision of 12 digits.
One other issue is that the formatters currently interleave formats: for example both 1,000,000 and 1e+12 might be visible simultaneously. Would we want the formatters to switch such that all labels use the same style? I could see arguments in both directions (stability vs. consistency), so would like to get other folks' opinions here.

cc @kanitw @domoritz @arvind

jheer commented 7 years ago

Also, I should note that the formatting for linear, etc is currently done using d3's number formatting. So these defaults are inherited. Changes/fixes of a general nature might require edits / pull requests to the d3-format repo, or additional wrapper code in Vega (as we currently do for log scales).

kanitw commented 7 years ago

Here are my quick opinions from thinking without experimenting:

What cutoffs should we use for switching notation? For example, the log formatter currently uses a default precision of 12 digits.

I think the cutoff should be at least 9 as I think a million or hundred million (e.g.,100,000,000) are still readable. A billion or hundred billion (e.g.,100,000,000,000) may also be still readable so I guess the reasonable cut-off would be around 9-12. (21 is definitely too high.)

For linear, etc should we institute similar policies as we use for log (or vice versa) and switch to exponential (scientific) notation for large/small numbers at the same threshold values?

This sounds reasonable. (Is there any reason that concern you about doing this?)

One other issue is that the formatters currently interleave formats: for example both 1,000,000 and 1e+12 might be visible simultaneously. Would we want the formatters to switch such that all labels use the same style? I could see arguments in both directions (stability vs. consistency), so would like to get other folks' opinions here.

Assuming stability means consistency among different plots that contain numbers with different levels of magnitude, I think consistency within the same plot is more important. For a plot with a non-linear scale, having both 1,000,000 and 1e+12 in the same plot would be a bit weird.

If users want consistent format among different plots, they can still explicitly set the format anyway.

Also, I should note that the formatting for linear, etc is currently done using d3's number formatting. So these defaults are inherited. Changes/fixes of a general nature might require edits / pull requests to the d3-format repo, or additional wrapper code in Vega (as we currently do for log scales).

A related question for vega-tooltip, if we have to wrap d3-format anyway I wonder if we can have a single format method that Vega-Tooltip can call. (Currently, it calls d3-format directly, so the log format is already inconsistent with Vega.)

cc: @sirahd

domoritz commented 7 years ago

I agree that smarter formatting in Vega is the right approach going forward.

Two questions:

1) How do you specify "no formatting" in vega formatting expressions?

2) @kanitw suggested starting with SI units at 10e9 or higher. This can be problematic with x-axis labels:

Should we have special cases for different axes? Should the cutoff be customizable?

kanitw commented 7 years ago

@domoritz Great points. I believe this x-axis issue partly motivated why we used SI format in the past.

jheer commented 7 years ago

Thanks all for the input. Responding to @domoritz:

The "default formatting" option is used if the no format parameter is included in the axis definition ("format": null should work similarly).
Good point regarding x-axis labels, though perhaps we want to treat this as a separate issue. We might want to first pick reasonable values regardless of space, then incorporate additional mechanisms for smarter space use. I agree that this would be nice to get "right"!

domoritz commented 7 years ago

Agree with all of the above. I guess instead of outputting "format": null, we can just not specify a format at all. For expressions, we can generate expr: 'format(datum["foo"], null).

jheer commented 7 years ago

Hmm, interesting point regarding expressions. Right now that uses a separate mechanism that directly calls d3-format and d3-time-format. Only the AxisTicks and LegendEntries operators in vega-encode involve specialized processing paths. Another item for me to look into...

jheer commented 7 years ago

Also, using the format expression function lies outside the context of a particular scale, and so can not support any scale-specific formatting actions. So I'm not sure these should be included here.

domoritz commented 7 years ago

Do we plan to ship a smarter formatter in Vega 3 and remove the default format from Vega-Lite 2?

jheer commented 7 years ago

I think we should. The defaults in Vega 3 are already better defaults, even though there remain additional potential improvements.

domoritz commented 7 years ago

I agree. However, I'm a bit worried about cases like https://github.com/vega/vega-lite/issues/1539#issuecomment-303500863. Should we try to have a smarter default here or just rely on the users to provide a better format (e.g. s)?

kanitw commented 7 years ago

Ping @jheer -- I think we definitely should handle https://github.com/vega/vega-lite/issues/1539#issuecomment-303500863 somehow before changing the default format to undefined.

jheer commented 7 years ago

I have added a new labelOverlap option for axes that supports multiple strategies for automatic overlap removal. The default setting does nothing: no overlap removal is performed.

Other options:

true or "parity": Reduce overlap by attempting to remove every other label, repeating this process until overlap is removed or only two labels remain. If only two labels remain, the first and last label will be shown. This setting is appropriate for linear, sqrt, and pow settings.
"greedy": Reduce overlap by sequentially scanning the labels, removing any items whose bounds overlap the last visible label. This strategy works better than "parity" for log scales, though is not "ideal" in that it does not actually reason about which ticks are the most useful for interpreting a log scale.

The Vega-Lite compiler can thus set the axis labelOverlap property to enable automatic overlap removal if desired. These modifications have been published in Vega sub-modules and will be included (as an undocumented soft launch) in the next Vega beta release.

Note that the default "parity" strategy could also be applied to band/point scales. However, if you switch to horizontally-oriented labels for ordinal scales I recommend using the limit encoding channel to limit the text according to the range step size.

vega / vega-lite

Better default axis number format #1539