pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.25k stars 17.79k forks source link

API/CLN: timeseries plotting #15071

Open jorisvandenbossche opened 7 years ago

jorisvandenbossche commented 7 years ago

Inspired by the timedelta plotting issue, I thought to look again at our timeseries plotting machinery. We know it is quite complex, and due to that several bugs, inconsistencies or unexpected behaviours exist (eg different results depending on order of plotting several serieses, wrong results when combining different types of time series, among others https://github.com/pandas-dev/pandas/issues/9053, https://github.com/pandas-dev/pandas/issues/6608, https://github.com/pandas-dev/pandas/issues/14322, ..). There has been some discussion related to this on the tsplot refactor PR of @sinhrks https://github.com/pandas-dev/pandas/pull/7670 (not merged).

One of the reasons of the complexities is the distinction between 'irregular' and 'regular' time series (see eg https://github.com/pandas-dev/pandas/pull/7670#issuecomment-149235874):

So part of the problems and confusions comes from the differences between both (eg different label formatting) and from combining those two. Leading to the question:

Do we need both types of timeseries plotting?

The question is what the reason is that we convert DatetimeIndex to periods for plotting. The reasons I can think of:

Others reasons that I am missing?

But, there are also clear drawbacks. Apart from the things mentioned above, you sometimes get clearly wrong behaviour: see eg the plot in https://github.com/pandas-dev/pandas/pull/7670#issuecomment-57410361. In this case, the dates somewhere within a month, are snapped to the month edges when first a regular series is plotted with monthy frequency. Another example of 'wrong' plotting is a yearly series (bug with freq 'A-dec', so end of year) plotted in the beginning of a year. See http://nbviewer.jupyter.org/gist/jorisvandenbossche/c0c68dce2fa02f1dfc4a8c343ec88cb6. But of course, in many cases, this behaviour is can also be the desired behaviour.

But do we need both? Would we want, if possible, to unify into one approach?

Can we unify both approaches?

Can we just use the matplotlib floats for timeseries plotting? Or always use the period-based machinery?

cc @pandas-dev/pandas-core (especially @TomAugspurger and @sinhrks, I think you haven been most involved in plotting code recently, or @wesm for historical viewpoint) I know it's a long issue, but if you could give it a read and give your thoughts on this, very welcome!

jreback commented 7 years ago

cc @tcaswell cc @mdboom

wesm commented 7 years ago

As I recall, the time series plotting with periods originated in scikits.timeseries. I am not especially attached to it -- if you can unify / have a single code path for plotting without significantly changing functionality, sounds good to me.