mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.43k stars 1.91k forks source link

Countplot has inappropriate response to xaxis being a date column #2696

Closed JMBurley closed 2 years ago

JMBurley commented 2 years ago

There is a bug in countplot when the x axis is a date column.

The raw seaborn plot shows apparently correct dates, but any interaction with the matplotlib axes object shows that the date information has been damaged and appears to be near the start of unix epoch time.

I believe this is a bug where, somehow, the year information in the original date column is destroyed while making the countplot.

Versions

Reproducible example

Code below generates lineplots and countplots.

The lineplots have correct dates on the x axis. The countplot displays dates in Jan 1970 (start of unix epoch time) when mdates formatting is applied.

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import datetime as dt

# -- Load and clean subset of taxis data
taxis = sns.load_dataset('taxis')
taxis['pickup'] = pd.to_datetime(taxis['pickup'])
taxis['pickup_date'] = taxis['pickup'].dt.date
date_mask = (taxis['pickup_date']>=dt.date(2019,3,1)) & (taxis['pickup_date']<=dt.date(2019,3,3))
taxis = taxis[date_mask].sort_values(by=['pickup_date'])

def format_time_axis_example(ax):
    """Toy example using mdates to tidy dates on x axis"""
    # setup formats
    days = mdates.DayLocator(bymonthday=range(0, 31, 1))  # every nth day
    minor_fmt = mdates.DateFormatter("%d")  # eg. 05
    months = mdates.MonthLocator()  # every month
    major_fmt = mdates.DateFormatter("%b%n%Y")  # eg. Jan 2021

    # format the ticks
    ax.xaxis.set_major_locator(months)
    ax.xaxis.set_major_formatter(major_fmt)
    ax.xaxis.set_minor_locator(days)
    ax.xaxis.set_minor_formatter(minor_fmt)

    # ensure ticks exist and control appearance
    ax.tick_params(axis="x", which="both", bottom=True)  # major+minor displayed

# -- Lineplot with correct response to time axis formatting
print(f"Actual Date Range: {taxis['pickup_date'].min()} -- {taxis['pickup_date'].max()}")
ax = sns.lineplot(x='pickup_date', y='total',data=taxis)
plt.xticks(rotation=45)
plt.show()
ax = sns.lineplot(x='pickup_date', y='total',data=taxis)
format_time_axis_example(ax)
plt.show()

# -- Countplot with incorrect response to time axis formatting
print(f"Actual Date Range: {taxis['pickup_date'].min()} -- {taxis['pickup_date'].max()}")
ax = sns.countplot(x='pickup_date', data=taxis)
plt.show()
ax = sns.countplot(x='pickup_date', data=taxis)
format_time_axis_example(ax)
plt.show()

Lineplot working as expected: image

Countplot with damaged date information on the x-axis, rendering in mdates as 1970: image

mwaskom commented 2 years ago

Hi, this is a limitation, but not a bug. The behavior of the categorical plots is documented in a number of places; perhaps this FAQ entry will help explain what's happening.

mwaskom commented 2 years ago

In general, the way to work around this would be to apply your date formatting before passing to to the categorical, function, so it gets strings formatted the way you want them. But for countplot specifically, you can use histplot instead, which is able to leverage matplotlib's more recent unit functionality:

ax = sns.histplot(x='pickup_date', data=taxis, discrete=True, shrink=.8)
format_time_axis_example(ax)

image

JMBurley commented 2 years ago

Thanks for the expert commentary @mwaskom . That all makes sense.

I think that this issue will get its approximate resolution in #2429. Happy for this to be closed or kept open as suits the project.

I suspect this issue should be closed. Optionally can have an issue generally referencing the idea that date information being preserved for categorical plots in a manner that interacts correctly with mdates formatting is a nice enhancement (although perhaps prohibitively annoying to code given how things are currently constructed).

PS. I'm somewhat fascinated that a [0:1:n] indexed list get interpreted by mdates as days-from-1970, because that is a very very unusual way to record unix time. But that's not a topic directly relevant to seaborn.

JMBurley commented 2 years ago

And to provide full code shortcut for anyone stumbling across this issue, the fullest replication of countplot by histplot is to use multiple='dodge' as below, which should cover (the vast majority of?) countplot use cases.

ax = sns.histplot(x='pickup_date', hue='payment', data=taxis, discrete=True, shrink=.8, multiple='dodge', bins=taxis['pickup_date].nunique())
format_time_axis_example(ax)

image

mwaskom commented 2 years ago

I think that this issue will get its approximate resolution in #2429. Happy for this to be closed or kept open as suits the project.

Yes, exactly. That work is stalled on some deeper refactoring, but this will at some point work better in countplot.

PS. I'm somewhat fascinated that a [0:1:n] indexed list get interpreted by mdates as days-from-1970, because that is a very very unusual way to record unix time. But that's not a topic directly relevant to seaborn.

Yes this is just matplotlib's way of doing things. What would you have expected, seconds since unix epoch?

... which should cover (the vast majority of?) countplot use cases

Yeah histplot can just about replace countplot. Really, it's a lot more powerful, as it has features like stacking and normalization that countplot lacks. The one advantage of countplot is the order parameter, which histplot lacks because it's not a categorical function.

JMBurley commented 2 years ago

What would you have expected, seconds since unix epoch?

Yep.

I've been around a lot of datetime datasets from a lots of businesses / scientific instruments and unix epoch as an integer count in days has never been the raw data type.

I wonder if mdates has some smart fallback where when it sees a small integer range it presumes days when it would otherwise have defaulted to seconds or microseconds.

mwaskom commented 2 years ago

integer count in days has never been the raw data type

To be clear, it's a float not an integer; sub-day resolution is possible through fractional values. Actually, using days rather than seconds gives you microsecond precision over a much wider range of dates.

wonder if mdates has some smart fallback

nope, it's a pretty straightforward (though they did recently realign the epoch, which used to be 1900-01-1; basically it was never originally intended to be "unix time").