mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.55k stars 1.92k forks source link

"Auto" Histogram bins for weighted histograms. #2710

Closed Erotemic closed 2 years ago

Erotemic commented 2 years ago

For univariate histogram data with weights seaborn currently just sets the number of bins to 10 and warns the user that they will need to adjust.

Based on the numpy automatic bin selection code, I wrote something similar that seems to work well for weighted histogram points. I'm not sure if there is some theoretical problem with computing bandwidth estimators for weighted data, which is why numpy also doesn't have it, but the adaption Sturges and the Freedman-Diaconis bandwidth estimators I wrote seem to work well, and in any case might be a better default to use than "10".

The MWE of the proposed code is here:

def _weighted_auto_bins(data, xvar, weightvar):
    """
    Generalized histogram bandwidth estimators for weighted univariate data

    Example:
        >>> import pandas as pd
        >>> import numpy as np
        >>> n = 100
        >>> to_stack = []
        >>> rng = np.random.RandomState(432)
        >>> for group_idx in range(3):
        >>>     part_data = pd.DataFrame({
        >>>         'x': np.arange(n),
        >>>         'weights': rng.randint(0, 100, size=n),
        >>>         'hue': [f'group_{group_idx}'] * n,
        >>>     })
        >>>     to_stack.append(part_data)
        >>> data = pd.concat(to_stack).reset_index()
        >>> xvar = 'x'
        >>> weightvar = 'weights'
        >>> n_equal_bins = _weighted_auto_bins(data, xvar, weightvar)
        >>> # xdoctest: +REQUIRES(--show)
        >>> import seaborn as sns
        >>> sns.histplot(data=data, bins=n_equal_bins, x='x', weights='weights', hue='hue')
    """
    sort_df = data.sort_values(xvar)
    values = sort_df[xvar]
    weights = sort_df[weightvar]
    minval = values.iloc[0]
    maxval = values.iloc[-1]

    total = weights.sum()
    ptp = maxval - minval

    # _hist_bin_sqrt = ptp / np.sqrt(total)
    _hist_bin_sturges = ptp / (np.log2(total) + 1.0)

    cumtotal = weights.cumsum().values
    quantiles = cumtotal / cumtotal[-1]
    idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
    iqr = values.iloc[idx2] - values.iloc[idx1]
    _hist_bin_fd = 2.0 * iqr * total ** (-1.0 / 3.0)

    fd_bw = _hist_bin_fd  # Freedman-Diaconis
    sturges_bw = _hist_bin_sturges

    if fd_bw:
        bw_est = min(fd_bw, sturges_bw)
    else:
        # limited variance, so we return a len dependent bw estimator
        bw_est = sturges_bw

    from numpy.lib.histograms import _get_outer_edges, _unsigned_subtract
    first_edge, last_edge = _get_outer_edges(values, None)
    if bw_est:
        n_equal_bins = int(np.ceil(_unsigned_subtract(last_edge, first_edge) / bw_est))
    else:
        # Width can be zero for some estimators, e.g. FD when
        # the IQR of the data is zero.
        n_equal_bins = 1

    # Take the minimum of this and the number of actual bins
    n_equal_bins = min(n_equal_bins, len(values))
    return n_equal_bins

If there is interest in this, I can make a PR where something like this replaces the default in the seaborn.distributions.plot_univariate_histogram code.

If there is some theoretical issue I'm unaware of that makes this strategy invalid, please send me a pointer to more info.

mwaskom commented 2 years ago

My preference is to defer entirely to numpy for this. If you can get them to accept your diff, I'd be happy to update seaborn to use it.

mwaskom commented 2 years ago

I'm going to close this for now but if you do submit this to numpy please link back to this issue; would be great to have this work in seaborn.

Erotemic commented 2 years ago

I did submit the idea to the numpy mailing list, but moderators haven't approved it yet. Numpy moves pretty slow, which is why I went here first, but I do agree it would be better to have this in numpy itself - if it is a mathematically valid thing to do; if not it might be better here as a heuristic, because it does seem better than the current method of just setting it to 10.

EDIT: It was approved, and their was one comment. I missed it. Just responded to it, hopefully discussion picks up.

For reference (and because there are useful links there), the one reply was:

To me, this feels like it might be a better fit for SciPy or possibly statsmodels (but maybe not since neither have histogram functions anymore).The challenge with weighted estimators is how the weights should be interpreted. Stata covers the most important cases of weights https://www.reed.edu/psychology/stata/gs/tutorials/weights.html. Would these be frequency weights? Stata supports only frequency weights https://www.stata.com/manuals/u11.pdf#u11.1.6weight.

Erotemic commented 2 years ago

I started a PR to numpy for this feature: https://github.com/numpy/numpy/pull/20655