Closed JMBurley closed 2 years ago
Hi, this is a limitation, but not a bug. The behavior of the categorical plots is documented in a number of places; perhaps this FAQ entry will help explain what's happening.
In general, the way to work around this would be to apply your date formatting before passing to to the categorical, function, so it gets strings formatted the way you want them. But for countplot
specifically, you can use histplot
instead, which is able to leverage matplotlib's more recent unit functionality:
ax = sns.histplot(x='pickup_date', data=taxis, discrete=True, shrink=.8)
format_time_axis_example(ax)
Thanks for the expert commentary @mwaskom . That all makes sense.
I think that this issue will get its approximate resolution in #2429. Happy for this to be closed or kept open as suits the project.
I suspect this issue should be closed. Optionally can have an issue generally referencing the idea that date information being preserved for categorical plots in a manner that interacts correctly with mdates
formatting is a nice enhancement (although perhaps prohibitively annoying to code given how things are currently constructed).
PS. I'm somewhat fascinated that a [0:1:n] indexed list get interpreted by mdates
as days-from-1970, because that is a very very unusual way to record unix time. But that's not a topic directly relevant to seaborn.
And to provide full code shortcut for anyone stumbling across this issue, the fullest replication of countplot by histplot is to use multiple='dodge'
as below, which should cover (the vast majority of?) countplot use cases.
ax = sns.histplot(x='pickup_date', hue='payment', data=taxis, discrete=True, shrink=.8, multiple='dodge', bins=taxis['pickup_date].nunique())
format_time_axis_example(ax)
I think that this issue will get its approximate resolution in #2429. Happy for this to be closed or kept open as suits the project.
Yes, exactly. That work is stalled on some deeper refactoring, but this will at some point work better in countplot
.
PS. I'm somewhat fascinated that a [0:1:n] indexed list get interpreted by mdates as days-from-1970, because that is a very very unusual way to record unix time. But that's not a topic directly relevant to seaborn.
Yes this is just matplotlib's way of doing things. What would you have expected, seconds since unix epoch?
... which should cover (the vast majority of?) countplot use cases
Yeah histplot
can just about replace countplot
. Really, it's a lot more powerful, as it has features like stacking and normalization that countplot
lacks. The one advantage of countplot
is the order
parameter, which histplot
lacks because it's not a categorical function.
What would you have expected, seconds since unix epoch?
Yep.
I've been around a lot of datetime datasets from a lots of businesses / scientific instruments and unix epoch as an integer count in days has never been the raw data type.
I wonder if mdates has some smart fallback where when it sees a small integer range it presumes days
when it would otherwise have defaulted to seconds
or microseconds
.
integer count in days has never been the raw data type
To be clear, it's a float not an integer; sub-day resolution is possible through fractional values. Actually, using days rather than seconds gives you microsecond precision over a much wider range of dates.
wonder if mdates has some smart fallback
nope, it's a pretty straightforward (though they did recently realign the epoch, which used to be 1900-01-1; basically it was never originally intended to be "unix time").
There is a bug in countplot when the x axis is a date column.
The raw seaborn plot shows apparently correct dates, but any interaction with the matplotlib axes object shows that the date information has been damaged and appears to be near the start of unix epoch time.
I believe this is a bug where, somehow, the year information in the original date column is destroyed while making the countplot.
Versions
Reproducible example
Code below generates lineplots and countplots.
The lineplots have correct dates on the x axis. The countplot displays dates in Jan 1970 (start of unix epoch time) when
mdates
formatting is applied.Lineplot working as expected:
Countplot with damaged date information on the x-axis, rendering in mdates as 1970: