python / cpython

The Python programming language
https://www.python.org
Other
63.08k stars 30.21k forks source link

statistics module: box and whisker 'plot' function #92779

Closed ghost closed 2 years ago

ghost commented 2 years ago

Feature or enhancement

A function in the statistics module that computes and returns the components of a box and whisper plot: minimum, first quartile, median, third quartile, maximum.

Pitch

A box and whisker plot is a very common way of summarizing data. Not only is it taught in schools, but it is quite standard for graphing and scientific calculators to implement them (returning 5 numbers and/or an actual plot).

The statistics module, "aimed at the level of graphing and scientific calculators", would be a perfect place for such a function.

Possible Implementation

Given a sequence of numbers, calculators (TI, Casio, "1-Var Stats" functions) typically employ the following method:

  1. calculate the median (where median is defined in the same way as that used by statistics.median());
  2. divide the data into two halves; the lower half is the set of all points strictly less than the median and the upper half is the set of all points strictly greater than the median;
  3. the first quartile is the median of the lower half and the third quartile is the median of the upper half.

If data is

This implements the aforementioned using statistics.median() as a basis:

def box(data):
    data = sorted(data)
    n = len(data)
    if n == 0:
        raise StatisticsError("box() empty data")
    if n == 1:
        return (data[0],) * 5

    i = n // 2
    j = i // 2
    if n % 2 == 1:
        median = data[i]
        if i % 2 == 1:
            q1 = data[j]
            q3 = data[i + j + 1]
        else:
            q1 = (data[j - 1] + data[j]) / 2
            q3 = (data[i + j] + data[i + j + 1]) / 2
    else:
        median = (data[i - 1] + data[i]) / 2
        if i % 2 == 1:
            q1 = data[j]
            q3 = data[i + j]
        else:
            q1 = (data[j - 1] + data[j]) / 2
            q3 = (data[i + j - 1] + data[i + j]) / 2

    return data[0], q1, median, q3, data[-1]

A few examples

>>> box([1])
(1, 1, 1, 1, 1)
>>> box([1, 2])
(1, 1, 1.5, 2, 2)
>>> box([1, 2, 3])
(1, 1, 2, 3, 3)
>>> box([1, 2, 3, 4])
(1, 1.5, 2.5, 3.5, 4)
>>> box([1, 2, 3, 4, 5])
(1, 1.5, 3, 4.5, 5)
>>> box([1, 2, 3, 4, 5, 6])
(1, 2, 3.5, 5, 6)

Edit: A couple of thoughts secondary to the above. Currently, the min, Q1, median, Q3, and max are returned. It feels very wrong to do so but if the function is going to sort the data (as the current median implementation does), it may be prudent to allow users to (1) specify whether the data is already sorted via some flag argument that defaults to False (2) specify whether they also want a reference to the sorted data returned. It is not uncommon, particularly in the context of box plots, for people to come up with arbitrary definitions of what is considered an outlier. For example, data points greater than Q3 + 1.5 * (Q3 - Q1) or less than Q1 - 1.5 * (Q3 - Q1) are sometimes considered outliers and these barriers are often used for the whiskers of box plots. If they also have the sorted data, they can use the information returned to further filter their data based on their own definitions of outliers. This extra functionality involving additional flag arguments may be a little too specialized.

rhettinger commented 2 years ago

FWIW, a person can already write, min(d), quantiles(d), max(d) and have most of what they need. Other aspects of box-and-whiskers aren't standard (dots for extreme values and whiskers at the first or second percentiles).

@stevendaprano At some point it would be great if you were to write down some guidance on what is in scope for this module; otherwise, it will fill with clutter. Right now, the module is reasonably well focused on core everyday analytical or descriptive stats and does not venture into plotting.