Open joooeey opened 1 year ago
Thanks @joooeey for the report
this actually works if you use matplotlib directly
This confirms that the issue's indeed on the pandas.plotting side
I suspect (but haven't checked) that somewhere in pandas.plotting, there's an assumption that pandas datetimes still all use nano-second resolution
Contributions would be welcome, this one might not be too tricky for newcomers
take
From my understanding so far, the issue is that matplotlib requires the number of days from the epoch (01-01-1970) for the datetime conversion, while pandas passes a float value of seconds elapsed since the epoch.
The missing piece is to understand where exactly this float value in seconds comes from in pandas. Possibly from TimeSeries_DateFormatter.set_locs()
but need to investigate further to better understand, which I'll continue to do in the next days.
thanks for the investigation @PrimeF ! sounds good - take your time, no hurry
@MarcoGorelli: I've looked into the issue further and the Period values in seconds actually originate from: https://github.com/pandas-dev/pandas/blob/103d3b2bb8912c87f3ec11c1048e7e0a19225fef/pandas/plotting/_matplotlib/converter.py#L252-L255
In order to get the date in elapsed days since epoch as required by the operations performed in the DateFormatter (see https://github.com/matplotlib/matplotlib/blob/2e2d2d5f574ad43ba87fc893098345db5eb1eacc/lib/matplotlib/dates.py#L357) one could instead leverage this branch https://github.com/pandas-dev/pandas/blob/103d3b2bb8912c87f3ec11c1048e7e0a19225fef/pandas/plotting/_matplotlib/converter.py#L256-L257 and pass freq='D'
.
However, this solutions seems to be highly "customised" to this issue, thus would require some specific conditions to avoid erroneously jumping in the branch and breaking other things down the line. I am actually not sure if this approach is the best one to follow, hence would appreciate any type of feedback.
thank for the investigation, I'll take a look
This bug has been discussed on the matplotlib side and there it sounded like being an issue with a different definition of the epoch. Maybe this helps. I'll also add a reference to this issue on the matplotlib issue.
I dug into this a bit since the PeriodConverter
logic looked very strange to me (as an end-user), and it seems like the bulk of the PeriodConverter
logic was written in 2012.
It seems to me possibly that PeriodConverter
had only ever been intended to be used with freq="D". Otherwise, PeriodConverter._convert_1d()
can return very different values depending on the frequency of the given periodic time series:
period = Period("2023-11-11 00:01.000", "T")
get_datevalue(period, "D") # 19672, convention that matplotlib uses e.g. matplotlib.dates.date2num(pd.Timestamp("2023-11-11 00:01.000"))
get_datevalue(period, "T") # 28327681
get_datevalue(period, "L") # 169947861900
So if a small time frequency like "T" (seconds) is given, at some point matplotlib will call something like num2date()
on view limits around 28327681
to format the x-labels, at which point an OverflowError
will be raised. Contrast with DatetimeConverter
which carefully calls mdates.date2num()
, and so will return values around 19672
.
and pass freq='D'.
Unfortunately changing the frequency from "T" to "D" here would lose resolution, e.g. all of the timestamps in the above example would get cast to 19672. It seems to me that the most reasonable thing to do would be to just defer to DatetimeConverter
and possibly remove PeriodConverter
entirely, although I could be missing some context where PeriodConverter
is actually useful.
Another frustrating issue that led me here is that the pandas plotting converters will try to auto-infer a periodic frequency, even if the index to be plotted is a non-periodic DatetimeIndex
:
import pandas as pd
import matplotlib.dates as mdates
ts = pd.Timestamp("2023-11-11 00:00:00", tz="UTC")
ts_plus_n = lambda x: ts + pd.Timedelta("1S") * x # add x seconds
s = pd.Series([0, 1, 2, 3], index=[ts, ts_plus_n(1), ts_plus_n(2), ts_plus_n(99)])
# assert s.index.inferred_freq is None and s.iloc[0:3].index.inferred_freq == "S"
In [2]: s.iloc[0:4].plot().xaxis.set_major_formatter(mdates.DateFormatter("%X")) # plots as expected
In [3]: s.iloc[0:3].plot().xaxis.set_major_formatter(mdates.DateFormatter("%X")) # OverflowError!
It seems to me that the most reasonable thing to do would be to just defer to DatetimeConverter and possibly remove PeriodConverter entirely, although I could be missing some context where PeriodConverter is actually useful.
I welcome removing anything Period-related, fancy trying this out and making a PR?
Seems like there is some history to this and its been debated a few times: #7670 #9053 #15071 #18768 #26253
PeriodConverter
is that some custom tick labelling can be done by pandas.jorisvandenbossche: The question is what the reason is that we convert DatetimeIndex to periods for plotting. The reasons I can think of: Performance. Currently, the regular plotting is faster (so for a regular series
ts.plot()
is faster asts.plot(x_compat=True)
). However, I think this could be solved as most of the time is spent in converting the datetimes to floats (which should be vectorizable). Nicer tick label locations and formatting. This is a clear plus, our (convoluted) ticklocators and formatters give much nicer results as the default matplotlib (IMO)
Wes even replied at one point
I am not especially attached to it -- if you can unify / have a single code path for plotting without significantly changing functionality, sounds good to me.
IMHO the user should explicitly request the period-based formatting instead of pandas
trying to automagically figure it out, but removing PeriodConverter
will likely cause the default tick formatting for periodic time series to change, so it might still be a bit controversial.
I think the following could be done as first steps:
DatetimeIndex
. It seems to drive more confusion than help: for example the issues linked above and a few stackoverflow qs like https://stackoverflow.com/questions/21189954/. It's sufficient to remove this branch (can emit a FutureWarning
if a softer transition is desired): https://github.com/pandas-dev/pandas/blob/157631d97840b7918eec4c8b40bd9c24b25771a7/pandas/plotting/_matplotlib/timeseries.py#L268pandas/plotting/_matplotlib/converter.py
, we can change PeriodConverter._convert_1d
https://github.com/pandas-dev/pandas/blob/157631d97840b7918eec4c8b40bd9c24b25771a7/pandas/plotting/_matplotlib/converter.py#L265 to call mdates.date2num()
(for example by calling DatetimeConverter
) so that all time series plots will be plotted with the x value units that matplotlib
expects. To soften the transition and allow the period tick formatting code in TimeSeries_Date{Locator,Formatter}
to still work, the reverse mapping lambda x: Period(mdates.date2num(x), freq=self.freq).ordinal
can be applied inside their methods.TimeSeries_*
can be renamed to PeriodSeries_*
to make the intent more clear.The pandas period tick formatter does seem a bit nicer than matplotlib's in some situations (although worse in others), so I'm more hesitant than before to try to remove the period-related code entirely.
I linked a few more directly related issues, and there's a few more that are linked to the issues in my comment above, so about double-digit issues on the tracker due to the same underlying issue here. Since pandas is changing the time unit when plotting, any other matplotlib feature involving x-coordinates (mdates.DateFormatter
, format_coord
, tooltips, axvspan
or any other artist plotted with datetimes, shared x-axes, etc) may break. I've checked most of these issues are resolved by the fix in the previous comment.
For tracking, there's a few other classes of issues related to Period plotting:
_daily_finder()
, which constructs a full date range with the GCD frequency that can result in a very large array. Caching can solve the first issue, the second issue maybe could be solved by downsampling the frequency.Sorry if this is off-topic, but I'm posting this for those who may encounter the same issue. I'm using matplotlib and encountered this error. I was feeding unix epoch as seconds to the x axis, and then tried to use matplotlib.time to format the data, but I got the overflow error. It turns out that _from_ordinalf
expects "days after epoch" instead of "seconds after epoch," so I tried dividing the timestamp by 24 * 60 * 60
, then everything worked fine.
Didn't get a chance to try completing this since the previous comments, but here's a proof-of-concept with using DatetimeConverter
instead of PeriodConverter
for PeriodIndex
: https://github.com/pandas-dev/pandas/compare/main...azhu-tower:pandas:period-converter An example with this commit is the plot in this comment: https://github.com/pandas-dev/pandas/issues/55110#issuecomment-1869499403. IIRC had profiled that the commit introduces some slowness likely due to calling date2num
on each tick instead of across all ticks.
Anyone have a PR? I'm still encountering this issue with the latest version.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
See also: #18348
When setting a
matplotlib.dates.DateFormatter
, I get the following exception:Full Traceback:
Expected Behavior
No exceptions are raised, and the major ticks selected by the AutoDateLocator are formatted in hh:mm format.
Installed Versions