Closed ghost closed 2 years ago
FWIW, a person can already write, min(d), quantiles(d), max(d)
and have most of what they need. Other aspects of box-and-whiskers aren't standard (dots for extreme values and whiskers at the first or second percentiles).
@stevendaprano At some point it would be great if you were to write down some guidance on what is in scope for this module; otherwise, it will fill with clutter. Right now, the module is reasonably well focused on core everyday analytical or descriptive stats and does not venture into plotting.
Feature or enhancement
A function in the
statistics
module that computes and returns the components of a box and whisper plot: minimum, first quartile, median, third quartile, maximum.Pitch
A box and whisker plot is a very common way of summarizing data. Not only is it taught in schools, but it is quite standard for graphing and scientific calculators to implement them (returning 5 numbers and/or an actual plot).
The
statistics
module, "aimed at the level of graphing and scientific calculators", would be a perfect place for such a function.Possible Implementation
Given a sequence of numbers, calculators (TI, Casio, "1-Var Stats" functions) typically employ the following method:
statistics.median()
);If data is
[1]
: my Casio returnsmin = q1 = median = q3 = max = 1
.[1, 2]
: my Casio returnsmin = q1 = 1
,median = 1.5
, andq3 = max = 2
.[1, 2, 3]
: my Casio returnsmin = 1
,q1 = 1
,median = 2
,q3 = 3
,max = 3
.[1, 2, 3, 4]
: my Casio returnsmin = 1
,q1 = 1.5
,median = 2.5
,q3 = 3.5
,max = 4
.[1, 2, 3, 4, 5]
: my Casio returnsmin = 1
,q1 = 1.5
,median = 3
,q3 = 4.5
,max = 5
.[1, 2, 3, 4, 5, 6]
: my Casio returnsmin = 1
,q1 = 2
,median = 3.5
,q3 = 5
,max = 6
.This implements the aforementioned using
statistics.median()
as a basis:A few examples
Edit: A couple of thoughts secondary to the above. Currently, the min, Q1, median, Q3, and max are returned. It feels very wrong to do so but if the function is going to sort the data (as the current
median
implementation does), it may be prudent to allow users to (1) specify whether the data is already sorted via some flag argument that defaults toFalse
(2) specify whether they also want a reference to the sorted data returned. It is not uncommon, particularly in the context of box plots, for people to come up with arbitrary definitions of what is considered an outlier. For example, data points greater thanQ3 + 1.5 * (Q3 - Q1)
or less thanQ1 - 1.5 * (Q3 - Q1)
are sometimes considered outliers and these barriers are often used for the whiskers of box plots. If they also have the sorted data, they can use the information returned to further filter their data based on their own definitions of outliers. This extra functionality involving additional flag arguments may be a little too specialized.