mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.51k stars 1.92k forks source link

Feature request: Timeseries distribution plot #3101

Closed EwoutH closed 1 year ago

EwoutH commented 2 years ago

Timeseries distribution plot

Sometimes you have a large population dataset that you want to visualize over time. Being able to quickly view how population values change over time could be very useful in this case. Two example plots are shown below.

The goal is to have a plot which shows different bands for different distribution values. For example, the median is a solid line, the middle 50 percentiles (25th to 75th) is the darkest color band, the 90 percentile (5th to 95th) is an tint lighter bar and the 98 percentile (1 to 99) is displayed with the lightest tint.

Naming and values

The plot could be named seaborn.timeseries_distribution(), with as input a Pandas series or a dictionary.

Using the values above, the default values could be:

seaborn.timeseries_distribution(data, band_percentiles=[50, 90, 98])

Input data

The input should be data containing multiple values for each point in time. This could be in the form of a Pandas series:

Timestep Values
0 0.5
0 0.6
0 0.2
0 0.3
1 0.4
1 0.7
... ...

In which the timestep is the index and the values the series values.

A dict, with the timestep as key and the values in a list, could also be possible:

{0: [0.5, 0.6, 0.2, 0.3],
 1: [0.4, 0.7, 0.3, 0.5],
 ...}

A DataFrame might also be possible, where the index is still the timestep and the column you select contains the values.

Sample data

Here is some sample data in Series form, containing 200 values for each of the 100 timestamps: wealth.zip.

The Pickle file in the ZIP file can be read with:

import pandas as pd
wealth = pd.read_pickle("wealth.pickle")

And converted to the dict with :

wealth_dict = wealth.groupby(level=0).agg(list).to_dict()

Existing examples

As for how it could look like, here are two examples I found:

unnamed-chunk-18-1 unnamed-chunk-17-1

Source: https://minimizeregret.com/post/2020/06/07/rediscovering-bayesian-structural-time-series/

Scope

Such plots could be very useful for simulation models or timeseries gathered population data. I'm curious if it might be in the scope of Seaborn.

EwoutH commented 2 years ago

I managed to created this plot from the wealth dataset listed above using matplotlib:

53018EA1-B9FD-493F-BCEA-FEEAE4BDC9A9

It required a lot of custom code however, so if seaborn could provide an interface for this that would be amazing!

mwaskom commented 2 years ago

Is this not something that could be accomplished with lineplot e.g.

fmri = sns.load_dataset("fmri").query("region == 'parietal'")
for interval in [50, 90, 95, 99]:
    sns.lineplot(
        fmri, x="timepoint", y="signal",
        estimator="median", errorbar=("pi", interval),
        color="C0",
    )

image

There are a couple of downsides here; it requires some redundant computation of the median (maybe slow for large datasets) and lineplot isn't completely flexible about how you style the error bands (namely you can't change the color, just the alpha — though I'd be open to a PR allowing that).

But it feels easier and more flexible with the objects interface:

p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
    p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
EwoutH commented 2 years ago

For my example dataset it works, and it's so extremely simple! I had no idea it could be so easy.

for interval in [50, 90, 98]:
    plot = sns.lineplot(wealth, estimator="median", errorbar=("pi", interval), color="C0")

wealth_plot_seaborn

Seaborn truly is amazing!

Could you leave this issue open? I would like to document this in an example. Then we can close this issue.

EwoutH commented 1 year ago

So I want to add this one as an example, but I think a legend would be useful which band represents which percentile. Is that possible, and if so, how could I add one using the objects interface?

p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
    p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
mwaskom commented 1 year ago

You're basically asking about https://github.com/mwaskom/seaborn/issues/3046, but if you're going to get differently-colored bands with alpha compositing than a simple legend for each band isn't going to make much sense.

Also I would prefer that the example gallery remain restricted to the function interface for now.