gvwilson / sql-tutorial

The Querynomicon: An Introduction to SQL for Wary Data Scientists
https://gvwilson.github.io/sql-tutorial/
Other
447 stars 47 forks source link

feat: example of calculating weighted quantiles #14

Open zaneselvans opened 8 months ago

zaneselvans commented 8 months ago

One query we (@catalyst-cooperative) have attempted to write a couple of times for use in data validation tests is the calculation of a weighted quantile. For example, if we have a table describing fuel deliveries to power plants, with one column that contains the average heat content per unit of fuel, and another column that contains the total number of units of fuel delivered, and the size of individual deliveries can vary by orders of magnitude, to make sure that nothing has gone haywire in our data processing we'd like to be able to quickly calculate, say, "How many MMBTUs does a ton of coal contain, at the 95% percentile across all the coal deliveries recorded in the table?"

In python we do it like:

import numpy as np
import pandas as pd

def weighted_quantile(data: pd.Series, weights: pd.Series, quantile: float) -> float:
    """Calculate the weighted quantile of a Series or DataFrame column.

    This function allows us to take two columns from a :class:`pandas.DataFrame` one of
    which contains an observed value (data) like heat content per unit of fuel, and the
    other of which (weights) contains a quantity like quantity of fuel delivered which
    should be used to scale the importance of the observed value in an overall
    distribution, and calculate the values that the scaled distribution will have at
    various quantiles.

    Args:
        data: A series containing numeric data.
        weights: Weights to use in scaling the data. Must have the same length as data.
        quantile: A number between 0 and 1, representing the quantile at which we want
            to find the value of the weighted data.

    Returns:
        The value in the weighted data corresponding to the given quantile. If there are
        no values in the data, return :mod:`numpy.nan`.
    """
    if (quantile < 0) or (quantile > 1):
        raise ValueError("quantile must have a value between 0 and 1.")
    if len(data) != len(weights):
        raise ValueError("data and weights must have the same length")
    df = (
        pd.DataFrame({"data": data, "weights": weights})
        .replace([np.inf, -np.inf], np.nan)
        .dropna()
        .sort_values(by="data")
    )
    Sn = df.weights.cumsum()  # noqa: N806
    # This conditional is necessary because sometimes new columns get
    # added to the EIA data, and so they won't show up in prior years.
    if len(Sn) > 0:
        Pn = (Sn - 0.5 * df.weights) / Sn.iloc[-1]  # noqa: N806
        return np.interp(quantile, Pn, df.data)

    return np.nan

This is useful because it lets us look at a weighted histogram of the variable and pick some guard-rails that are specific to the distribution, and save them to check at the end of our data processing. These typically look something like:

It would also be useful to be able to apply this calculation to records grouped by fuel type or year, or for a number of different quantiles simultaneously (to avoid needing to do all the sorting etc. multiple times)