pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.74k stars 17.95k forks source link

BUG: int Overflow with DateFormatter #52895

Open joooeey opened 1 year ago

joooeey commented 1 year ago

Pandas version checks

Reproducible Example

import datetime as dt
from matplotlib import dates
import pandas as pd

dates.set_epoch("2023-01-01")  # suggested workaround doesn't help
start = dt.datetime(2022, 4, 4, 7)
end = dt.datetime(2022, 4, 4, 16)
index = pd.date_range(start, end, freq="s")
df = pd.DataFrame(index=index)
df["col"] = 1
ax = df.plot()
ax.xaxis.set_major_formatter(dates.DateFormatter("%H:%M"))  # breaks

Issue Description

See also: #18348

When setting a matplotlib.dates.DateFormatter, I get the following exception:

OverflowError: int too big to convert

Full Traceback:

``` runfile('/home/lukas/.config/spyder-py3/temp.py', wdir='/home/lukas/.config/spyder-py3') [autoreload of pandas.core.arrays.timedeltas failed: Traceback (most recent call last): File "/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 273, in check superreload(m, reload, self.old_objects) File "/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 471, in superreload module = reload(module) ^^^^^^^^^^^^^^ File "/home/lukas/mambaforge/envs/moma/lib/python3.11/importlib/__init__.py", line 169, in reload _bootstrap._exec(spec, module) File "", line 621, in _exec File "", line 940, in exec_module File "", line 241, in _call_with_frames_removed File "/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/pandas/core/arrays/timedeltas.py", line 34, in from pandas._libs.tslibs.fields import ( ImportError: cannot import name 'get_timedelta_days' from 'pandas._libs.tslibs.fields' (/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/pandas/_libs/tslibs/fields.cpython-311-x86_64-linux-gnu.so) ] [autoreload of pandas._testing failed: Traceback (most recent call last): File "/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 273, in check superreload(m, reload, self.old_objects) File "/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 471, in superreload module = reload(module) ^^^^^^^^^^^^^^ File "/home/lukas/mambaforge/envs/moma/lib/python3.11/importlib/__init__.py", line 169, in reload _bootstrap._exec(spec, module) File "", line 621, in _exec File "", line 940, in exec_module File "", line 241, in _call_with_frames_removed File "/home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/pandas/_testing/__init__.py", line 914, in cython_table = pd.core.common._cython_table.items() ^^^^^^^^^^^^^^ AttributeError: module 'pandas.core' has no attribute 'common' ] Error in callback (for post_execute): Traceback (most recent call last): File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/pyplot.py:120 in _draw_all_if_interactive draw_all() File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/_pylab_helpers.py:132 in draw_all manager.canvas.draw_idle() File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/backend_bases.py:2082 in draw_idle self.draw(*args, **kwargs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/backends/backend_agg.py:400 in draw self.figure.draw(self.renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:95 in draw_wrapper result = draw(artist, renderer, *args, **kwargs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:72 in draw_wrapper return draw(artist, renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/figure.py:3140 in draw mimage._draw_list_compositing_images( File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/image.py:131 in _draw_list_compositing_images a.draw(renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:72 in draw_wrapper return draw(artist, renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/axes/_base.py:3064 in draw mimage._draw_list_compositing_images( File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/image.py:131 in _draw_list_compositing_images a.draw(renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:72 in draw_wrapper return draw(artist, renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/axis.py:1376 in draw ticks_to_draw = self._update_ticks() File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/axis.py:1263 in _update_ticks major_labels = self.major.formatter.format_ticks(major_locs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/ticker.py:218 in format_ticks return [self(value, i) for i, value in enumerate(values)] File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/ticker.py:218 in return [self(value, i) for i, value in enumerate(values)] File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/dates.py:651 in __call__ result = num2date(x, self.tz).strftime(self.fmt) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/dates.py:544 in num2date return _from_ordinalf_np_vectorized(x, tz).tolist() File ~/mambaforge/envs/moma/lib/python3.11/site-packages/numpy/lib/function_base.py:2329 in __call__ return self._vectorize_call(func=func, args=vargs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/numpy/lib/function_base.py:2412 in _vectorize_call outputs = ufunc(*inputs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/dates.py:359 in _from_ordinalf np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us')) OverflowError: int too big to convert Traceback (most recent call last): File ~/mambaforge/envs/moma/lib/python3.11/site-packages/IPython/core/formatters.py:340 in __call__ return printer(obj) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/IPython/core/pylabtools.py:152 in print_figure fig.canvas.print_figure(bytes_io, **kw) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/backend_bases.py:2342 in print_figure self.figure.draw(renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:95 in draw_wrapper result = draw(artist, renderer, *args, **kwargs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:72 in draw_wrapper return draw(artist, renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/figure.py:3140 in draw mimage._draw_list_compositing_images( File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/image.py:131 in _draw_list_compositing_images a.draw(renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:72 in draw_wrapper return draw(artist, renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/axes/_base.py:3064 in draw mimage._draw_list_compositing_images( File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/image.py:131 in _draw_list_compositing_images a.draw(renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/artist.py:72 in draw_wrapper return draw(artist, renderer) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/axis.py:1376 in draw ticks_to_draw = self._update_ticks() File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/axis.py:1263 in _update_ticks major_labels = self.major.formatter.format_ticks(major_locs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/ticker.py:218 in format_ticks return [self(value, i) for i, value in enumerate(values)] File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/ticker.py:218 in return [self(value, i) for i, value in enumerate(values)] File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/dates.py:651 in __call__ result = num2date(x, self.tz).strftime(self.fmt) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/dates.py:544 in num2date return _from_ordinalf_np_vectorized(x, tz).tolist() File ~/mambaforge/envs/moma/lib/python3.11/site-packages/numpy/lib/function_base.py:2329 in __call__ return self._vectorize_call(func=func, args=vargs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/numpy/lib/function_base.py:2412 in _vectorize_call outputs = ufunc(*inputs) File ~/mambaforge/envs/moma/lib/python3.11/site-packages/matplotlib/dates.py:359 in _from_ordinalf np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us')) OverflowError: int too big to convert
```

Expected Behavior

No exceptions are raised, and the major ticks selected by the AutoDateLocator are formatted in hh:mm format.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.11.3.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-135-generic Version : #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.1 numpy : 1.24.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.12.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None /home/lukas/mambaforge/envs/moma/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")
MarcoGorelli commented 1 year ago

Thanks @joooeey for the report

this actually works if you use matplotlib directly

image

This confirms that the issue's indeed on the pandas.plotting side

I suspect (but haven't checked) that somewhere in pandas.plotting, there's an assumption that pandas datetimes still all use nano-second resolution

Contributions would be welcome, this one might not be too tricky for newcomers

PrimeF commented 1 year ago

take

PrimeF commented 1 year ago

From my understanding so far, the issue is that matplotlib requires the number of days from the epoch (01-01-1970) for the datetime conversion, while pandas passes a float value of seconds elapsed since the epoch.

The missing piece is to understand where exactly this float value in seconds comes from in pandas. Possibly from TimeSeries_DateFormatter.set_locs() but need to investigate further to better understand, which I'll continue to do in the next days.

MarcoGorelli commented 1 year ago

thanks for the investigation @PrimeF ! sounds good - take your time, no hurry

PrimeF commented 1 year ago

@MarcoGorelli: I've looked into the issue further and the Period values in seconds actually originate from: https://github.com/pandas-dev/pandas/blob/103d3b2bb8912c87f3ec11c1048e7e0a19225fef/pandas/plotting/_matplotlib/converter.py#L252-L255

In order to get the date in elapsed days since epoch as required by the operations performed in the DateFormatter (see https://github.com/matplotlib/matplotlib/blob/2e2d2d5f574ad43ba87fc893098345db5eb1eacc/lib/matplotlib/dates.py#L357) one could instead leverage this branch https://github.com/pandas-dev/pandas/blob/103d3b2bb8912c87f3ec11c1048e7e0a19225fef/pandas/plotting/_matplotlib/converter.py#L256-L257 and pass freq='D'.

However, this solutions seems to be highly "customised" to this issue, thus would require some specific conditions to avoid erroneously jumping in the branch and breaking other things down the line. I am actually not sure if this approach is the best one to follow, hence would appreciate any type of feedback.

MarcoGorelli commented 1 year ago

thank for the investigation, I'll take a look

ford--prefect commented 1 year ago

This bug has been discussed on the matplotlib side and there it sounded like being an issue with a different definition of the epoch. Maybe this helps. I'll also add a reference to this issue on the matplotlib issue.

azjps commented 11 months ago

I dug into this a bit since the PeriodConverter logic looked very strange to me (as an end-user), and it seems like the bulk of the PeriodConverter logic was written in 2012.

It seems to me possibly that PeriodConverter had only ever been intended to be used with freq="D". Otherwise, PeriodConverter._convert_1d() can return very different values depending on the frequency of the given periodic time series:

period = Period("2023-11-11 00:01.000", "T")
get_datevalue(period, "D")  # 19672, convention that matplotlib uses e.g. matplotlib.dates.date2num(pd.Timestamp("2023-11-11 00:01.000"))
get_datevalue(period, "T")  # 28327681
get_datevalue(period, "L")  # 169947861900

So if a small time frequency like "T" (seconds) is given, at some point matplotlib will call something like num2date() on view limits around 28327681 to format the x-labels, at which point an OverflowError will be raised. Contrast with DatetimeConverter which carefully calls mdates.date2num(), and so will return values around 19672.

and pass freq='D'.

Unfortunately changing the frequency from "T" to "D" here would lose resolution, e.g. all of the timestamps in the above example would get cast to 19672. It seems to me that the most reasonable thing to do would be to just defer to DatetimeConverter and possibly remove PeriodConverter entirely, although I could be missing some context where PeriodConverter is actually useful.

Another frustrating issue that led me here is that the pandas plotting converters will try to auto-infer a periodic frequency, even if the index to be plotted is a non-periodic DatetimeIndex:

import pandas as pd
import matplotlib.dates as mdates
ts = pd.Timestamp("2023-11-11 00:00:00", tz="UTC")
ts_plus_n = lambda x: ts + pd.Timedelta("1S") * x  # add x seconds
s = pd.Series([0, 1, 2, 3], index=[ts, ts_plus_n(1), ts_plus_n(2), ts_plus_n(99)])
# assert s.index.inferred_freq is None and s.iloc[0:3].index.inferred_freq == "S"

In [2]: s.iloc[0:4].plot().xaxis.set_major_formatter(mdates.DateFormatter("%X"))  # plots as expected
In [3]: s.iloc[0:3].plot().xaxis.set_major_formatter(mdates.DateFormatter("%X"))  # OverflowError!
MarcoGorelli commented 11 months ago

It seems to me that the most reasonable thing to do would be to just defer to DatetimeConverter and possibly remove PeriodConverter entirely, although I could be missing some context where PeriodConverter is actually useful.

I welcome removing anything Period-related, fancy trying this out and making a PR?

azjps commented 11 months ago

Seems like there is some history to this and its been debated a few times: #7670 #9053 #15071 #18768 #26253

15071 provides the best summary, the main benefit (or drawback, depending on one's viewpoint) of PeriodConverter is that some custom tick labelling can be done by pandas.

jorisvandenbossche: The question is what the reason is that we convert DatetimeIndex to periods for plotting. The reasons I can think of: Performance. Currently, the regular plotting is faster (so for a regular series ts.plot() is faster as ts.plot(x_compat=True)). However, I think this could be solved as most of the time is spent in converting the datetimes to floats (which should be vectorizable). Nicer tick label locations and formatting. This is a clear plus, our (convoluted) ticklocators and formatters give much nicer results as the default matplotlib (IMO)

Wes even replied at one point

I am not especially attached to it -- if you can unify / have a single code path for plotting without significantly changing functionality, sounds good to me.

IMHO the user should explicitly request the period-based formatting instead of pandas trying to automagically figure it out, but removing PeriodConverter will likely cause the default tick formatting for periodic time series to change, so it might still be a bit controversial.

azjps commented 10 months ago

I think the following could be done as first steps:

  1. No longer infer periodic frequency when plotting DatetimeIndex. It seems to drive more confusion than help: for example the issues linked above and a few stackoverflow qs like https://stackoverflow.com/questions/21189954/. It's sufficient to remove this branch (can emit a FutureWarning if a softer transition is desired): https://github.com/pandas-dev/pandas/blob/157631d97840b7918eec4c8b40bd9c24b25771a7/pandas/plotting/_matplotlib/timeseries.py#L268
  2. In pandas/plotting/_matplotlib/converter.py, we can change PeriodConverter._convert_1d https://github.com/pandas-dev/pandas/blob/157631d97840b7918eec4c8b40bd9c24b25771a7/pandas/plotting/_matplotlib/converter.py#L265 to call mdates.date2num() (for example by calling DatetimeConverter) so that all time series plots will be plotted with the x value units that matplotlib expects. To soften the transition and allow the period tick formatting code in TimeSeries_Date{Locator,Formatter} to still work, the reverse mapping lambda x: Period(mdates.date2num(x), freq=self.freq).ordinal can be applied inside their methods.
  3. TimeSeries_* can be renamed to PeriodSeries_* to make the intent more clear.

The pandas period tick formatter does seem a bit nicer than matplotlib's in some situations (although worse in others), so I'm more hesitant than before to try to remove the period-related code entirely.

azjps commented 10 months ago

I linked a few more directly related issues, and there's a few more that are linked to the issues in my comment above, so about double-digit issues on the tracker due to the same underlying issue here. Since pandas is changing the time unit when plotting, any other matplotlib feature involving x-coordinates (mdates.DateFormatter, format_coord, tooltips, axvspan or any other artist plotted with datetimes, shared x-axes, etc) may break. I've checked most of these issues are resolved by the fix in the previous comment.

For tracking, there's a few other classes of issues related to Period plotting:

TonalidadeHidrica commented 5 months ago

Sorry if this is off-topic, but I'm posting this for those who may encounter the same issue. I'm using matplotlib and encountered this error. I was feeding unix epoch as seconds to the x axis, and then tried to use matplotlib.time to format the data, but I got the overflow error. It turns out that _from_ordinalf expects "days after epoch" instead of "seconds after epoch," so I tried dividing the timestamp by 24 * 60 * 60, then everything worked fine.

azhu-tower commented 5 months ago

Didn't get a chance to try completing this since the previous comments, but here's a proof-of-concept with using DatetimeConverter instead of PeriodConverter for PeriodIndex: https://github.com/pandas-dev/pandas/compare/main...azhu-tower:pandas:period-converter An example with this commit is the plot in this comment: https://github.com/pandas-dev/pandas/issues/55110#issuecomment-1869499403. IIRC had profiled that the commit introduces some slowness likely due to calling date2num on each tick instead of across all ticks.

ba05 commented 4 months ago

Anyone have a PR? I'm still encountering this issue with the latest version.