mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.42k stars 1.91k forks source link

[discussion] Adding ECDF to seaborn? #1536

Closed ericmjl closed 4 years ago

ericmjl commented 6 years ago

@mwaskom referencing this tweet re: ECDFs; I have a simple implementation ready to go which I have stored in textexpander, but I think it might be a useful contribution to seaborn users.

The simplest unit of visualization is a scatterplot, for which an API might be:

def ecdf(df, column, ax=None, step=True):
    #### "if ax" logic goes here" ####
    np.sort(df[column]), np.arange(1, len(df)+1) / len(df)
    if step:
        ax.step(x, y)
    else:
        ax.scatter(x, y)
    return ax

With this plotting unit, it can be easily inserted into the pairplot as a replacement for the histogram that occurs on the diagonal (as an option for end-users, of course, not mandatory). I can also see extensions to other kinds of plots, for example, plotting multiple ECDFs on the same axes object.

As I understand it, distplot exists, and yes, granted, visualizing histograms is quite idiomatic for many users. That said, I do see some advantages of using ECDFs over histograms, the biggest one being that all data points are plotted, meaning it is impossible to bias the data using bins. I have more details in a blog post, but at a high level, the other biggest advantage I can see is reading off quantiles from the data easily. Also, compared to estimating a KDE, we make no assumptions regarding how the data are distributed (though yes, we can debate whether this is a good or bad thing).

If you're open to having ECDFs inside seaborn, I'm happy to work on a PR for this. Might need some guidance to see if there's special things I need to look out for in the codebase (admittedly, it'll be my first PR to seaborn). Please let me know; I'm also happy to discuss more here before taking any action.

deniederhut commented 6 years ago

I'd like to see this, and would be happy to help implement it.

dsaxton commented 6 years ago

I like this idea as well. Perhaps it would be best to plot it as a simple step function as in R's ecdf? (That is, horizontal lines with circles at the discontinuity points, open from below and closed from above.)

ericmjl commented 6 years ago

@dsaxton great idea! I have updated the code sample above such that the step function is the default, while the scatter is the alternative. This is mostly a convenience thing, just to keep the implementation simple, but yes, if @mwaskom agrees to this PR, then I'm happy to work together with you on the more fancy version you're describing.

mwaskom commented 6 years ago

Sorry folks but I'm basically unavailable to review new features for at least the next few months.

ECDF plots are nice idea in general but the API would need a lot of thinking — the proposed example (which I understand is just a proof of concept) would not fit in with existing seaborn tools.

Given that the only manipulation of the data itself can be done in one line I'd strongly encourage you to try to get this into pandas or matplotlib itself.

BTW, you may be aware but you can use an arbitrary unidimensional plotter (including self-defined ones) on the diagonal of a JointGrid, it doesn't need to be a seaborn function to work.

ericmjl commented 6 years ago

@mwaskom totally understand, no worries! Thanks for taking the time to reply nonetheless :smile:. Also, thanks for the pointers!

mwaskom commented 6 years ago

Gonna leave this open because it would be good to revisit in the future (though if you consider that "add qqplots" is one of the oldest open issues, the timescale of "the future" may be unbounded!)

pgromano commented 5 years ago

For what it's worth, I modified @ericmjl 's original code since I found it useful. Happy to help on the development.

def ecdf(x=None, data=None, ax=None, step=True, palette=None, **kwargs):
    """ Empirical Cumulative  Distribution Function

    Arguments
    -----------
    x : str or array-like, optional
        Inputs for plotting long-form data.
    data : DataFrame, array, or list of arrays, optional
        Dataset for plotting. If `x` and `y` are absent, this is interpreted as wide-form.
        Otherwise, data is expected to be long-form.
    ax : matplotlib.axes, optional
        Axes object to draw the plot onto, otherwise uses the current axes.
    step : bool, optional
        Whether or not to plot ECDF as horizontal steps
    palette : palette name, list, or dict, optional
        Colors to use for the different levels of the `hue` variable. Should be somthing that
        can be interpreted by `color_palette()` or a dictionary mapping hue levels to
        matplotlib colors.
    **kwargs : Other keyword arguments are passed through to `plt.step` or `plt.scatter` at draw time

    """

    # if no axes object create one
    if ax is None:
        fig, ax = plt.subplots()

    # set palette if passed
    if palette is not None:
        sns.set_palette(palette)

    # safety check on data
    if x is None and data is None:
        raise ValueError('No data passed')

    if isinstance(x, str) and data is None:
        raise ValueError('Unable to understand how to interpret data')

    if isinstance(x, str) and data is not None:
        xlabel = x
        x = data[x]
    elif isinstance(x, pd.Series):
        xlabel = x.name
    elif isinstance(x, np.ndarray):
        xlabel = 'X'
    ylabel = 'ECDF'

    # sort values and get cumulative sum
    x_val, cdf = np.sort(x), np.arange(1, len(x) + 1) / len(x)
    if step:
        ax.step(x_val, cdf, **kwargs)
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
    else:
        ax.scatter(x_val, cdf, **kwargs)
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
    return ax
grofte commented 4 years ago

For what it's worth, I modified @ericmjl 's original code since I found it useful. Happy to help on the development.

This doesn't seem to work with x= a list. Error message

UnboundLocalError                         Traceback (most recent call last)
<ipython-input-69-ea8913fd0868> in <module>
----> 1 ecdf(my_list)

<ipython-input-68-995e6d97b86c> in ecdf(x, data, ax, step, palette, **kwargs)
     51     if step:
     52         ax.step(x_val, cdf, **kwargs)
---> 53         plt.xlabel(xlabel)
     54         plt.ylabel(ylabel)
     55     else:

UnboundLocalError: local variable 'xlabel' referenced before assignment
venaturum commented 4 years ago

Here is an alternate implementation I use which is based off lineplot and compatible with all lineplot features (except for confidence intervals which don't make sense).

`

    def ecdf(   x=None, hue=None, size=None, style=None, data=None,
             palette=None, hue_order=None, hue_norm=None,
             sizes=None, size_order=None, size_norm=None,
             dashes=True, markers=None, style_order=None,
             units=None, sort=True, err_style="band", err_kws=None,
             legend="brief", ax=None, **kwargs):

        percentile_column = '_ecdf_percentile'

        if data is None:
            x = np.sort(x)
            y = np.linspace(0,100,x.size)

        else:

            def create_percentile(sub_data, col):
                sub_data.sort_values(col, inplace=True)
                sub_data[percentile_column] = np.linspace(0,100,sub_data.shape[0])
                return sub_data

            y = percentile_column
            groups = list(filter(None.__ne__, (hue,size,style)))

            if groups == []:
                data = create_percentile(data, x)
            else:
                data = data.copy().groupby(groups).apply(create_percentile, x)

        return seaborn.lineplot(x=x, y=y, hue=hue, size=size, style=style, data=data,
            palette=palette, hue_order=hue_order, hue_norm=hue_norm,
            sizes=sizes, size_order=size_order, size_norm=size_norm,
            dashes=dashes, markers=markers, style_order=style_order,
            units=units, estimator=None, ci=None, n_boot=None,
            sort=sort, err_style=err_style, err_kws=err_kws, legend=legend,
            ax=ax, **kwargs,
        )

`

mwaskom commented 4 years ago

This is almost done. Feel free to weigh in on #2141 if you have thoughts about features or implementation.

mwaskom commented 4 years ago

Closed by #2141