pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.28k stars 17.8k forks source link

Misaligned X axis when plotting datetime indexed series with regular and irregular time index #29705

Open rhkarls opened 4 years ago

rhkarls commented 4 years ago

Code Sample, a copy-pastable example if possible

# Create sample data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

p1d = {'2016-08-10 10:00:00':     2.290438,
'2016-10-12 08:20:00':     1.314112,
'2016-11-15 12:45:00':     0.213702,
'2017-04-27 18:30:00':     0.256794,
'2017-05-30 11:10:00':     4.112614,
'2017-07-19 09:18:00':    10.600000}

p1 = pd.Series(p1d)
p1.index=pd.to_datetime(p1.index)

p2d = {'2016-08-09 09:15:00':    1.57970,
'2016-10-11 13:15:00':    0.73000,
'2017-04-27 12:30:00':    0.15900,
'2017-05-31 16:10:00':    1.65440,
'2018-05-24 12:00:00':    0.79260,
'2018-10-25 11:20:00':    0.34500}

p2 = pd.Series(p2d)
p2.index=pd.to_datetime(p2.index)

p3d = {'2016-11-15 09:00:00':    0.094900,
'2017-04-28 11:10:00':    0.055600,
'2017-05-30 16:00:00':    0.659600,
'2017-06-09 17:15:00':    0.300200,
'2018-05-24 16:45:00':    0.329800,
'2018-09-18 15:40:00':    0.200452}

p3 = pd.Series(p3d)
p3.index = pd.to_datetime(p3.index)

ts_index = pd.date_range('2016-01-01','2018-12-31',freq='H')

ts1 = pd.Series(index=ts_index, data=np.random.uniform(low=0.2,high=10,
                                                          size=ts_index.size))
ts2 = pd.Series(index=ts_index, data=np.random.uniform(low=0.1,high=3,
                                                          size=ts_index.size))
ts3 = pd.Series(index=ts_index, data=np.random.uniform(low=0.05,high=1,
                                                          size=ts_index.size))

# plot
fig_ts, axs_ts = plt.subplots(3,1,sharex=False)

ts1.plot(ax=axs_ts[0])
p1.plot(ax=axs_ts[0],style='o')

ts2.plot(ax=axs_ts[1])
p2.plot(ax=axs_ts[1],style='o')

ts3.plot(ax=axs_ts[2])
p3.plot(ax=axs_ts[2],style='o')

fig_ts.tight_layout()

image

Problem description

Data on the second axis is not plotted correctly on the X axis. The irregular point data is shifted relative to the other data. It seems this is caused by the uncommon timestamp between the first entry of the irregular timeseries and the regular timeseries.

The following changes causes the data to be plotted correctly, I assume due to having matching timestamps of the first entries:

Changing the plot order for the second axis also causes the data to be plotted on the correct place along the X axis, so plotting the irregular timeseries before the regular: p2.plot(ax=axs_ts[1],style='o', zorder=10) ts2.plot(ax=axs_ts[1], zorder=1)

This does however cause other issues such as different X axis labels, and using this method it will also fail if using sharex=True.

Possibly related issues:

11574 - Misaligned x axis using shared axis (one series plotted per axis), not when plotted on the same axis as here.

18571 - Misaligned x axis using twinx(), possibly same issue as here?

Expected Output

Plotting the data at the correct x axis coordinates.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : None.None pandas : 0.25.3 numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 41.6.0.post20191030 Cython : 0.29.13 pytest : 5.2.4 hypothesis : None sphinx : 2.2.1 blosc : None feather : None xlsxwriter : 1.2.6 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.9.0 pandas_datareader: None bs4 : 4.8.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.1 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : 1.3.11 tables : 3.6.1 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.6
kurtforrester commented 4 years ago

I think I too have encountered the same/similar issue.

I have irregular transactional data with a date timestamp. I perform some grouping and aggregating of the data to produce uniform mean balances. When plotting data with different timestamps there is an issue with an offset in the data representation.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

tr = pd.date_range("2014-04-01", "2019-12-31")

n = 300
ts = np.random.choice(tr, n)
tx = np.random.sample(size=n) * np.random.choice(
    (10, -15, -1, -0.5), n, p=[0.1, 0.025, 0.8, 0.075]
)

df = pd.DataFrame(data={"date": ts, "tx": tx})
df = df.sort_values(by=["date"]).reset_index()
df["balance"] = df["tx"].cumsum()

fig, axs = plt.subplots(2, 2)

mbalance = df.groupby(pd.Grouper(key="date", freq="M"))[["balance"]].mean()
print("monthly")
print(mbalance)
mbalance.plot(linestyle="none", marker="o", title="month", ax=axs[0, 0])

# company year ending is March
qbalance = df.groupby(pd.Grouper(key="date", freq="Q-MAR"))[["balance"]].mean()
print("quarterly")
print(qbalance)
qbalance.plot(linestyle="none", marker="s", title="quarter", ax=axs[0, 1])

abalance = df.groupby(pd.Grouper(key="date", freq="A-MAR"))[["balance"]].mean()
print("annually")
print(abalance)
abalance.plot(linestyle="none", marker="d", title="annum", ax=axs[1, 0])

mbalance.plot(linestyle="none", marker="o", ax=axs[1, 1])
qbalance.plot(linestyle="none", marker="s", ax=axs[1, 1])
abalance.plot(linestyle="none", marker="d", ax=axs[1, 1])

plt.legend()

When plotted onto individual subplot axes the data is rendered correctly (each is as expected). When the datasets are overlaid the quarterly and annual data are offset from their true position.

kurtforrester commented 4 years ago

A further example to present the issue. When the original data is plotted first there is no issue with alignment (all display correctly).

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

tr = pd.date_range("2014-04-01", "2019-12-31")

n = 300
ts = np.random.choice(tr, n)
tx = np.random.sample(size=n) * np.random.choice(
    (10, -15, -1, -0.5), n, p=[0.1, 0.025, 0.8, 0.075]
)

df = pd.DataFrame(data={"date": ts, "tx": tx})
df = df.sort_values(by=["date"]).reset_index()
df["balance"] = df["tx"].cumsum()

fig, axs = plt.subplots(1, 2)

mbalance = df.groupby(pd.Grouper(key="date", freq="M"))[["balance"]].mean()
print("monthly")
print(mbalance)

qbalance = df.groupby(pd.Grouper(key="date", freq="Q-MAR"))[["balance"]].mean()
print("quarterly")
print(qbalance)

abalance = df.groupby(pd.Grouper(key="date", freq="A-MAR"))[["balance"]].mean()
print("annually")
print(abalance)

df.plot(x="date", y="balance", marker="+", ax=axs[0])
mbalance.plot(linestyle="none", marker="o", ax=axs[0])
qbalance.plot(linestyle="none", marker="s", ax=axs[0])
abalance.plot(linestyle="none", marker="d", ax=axs[0])

plt.legend()

mbalance.plot(linestyle="none", marker="o", ax=axs[1])
qbalance.plot(linestyle="none", marker="s", ax=axs[1])
abalance.plot(linestyle="none", marker="d", ax=axs[1])

plt.legend()
rhkarls commented 4 years ago

Since the order of plotting clearly matters it might be how pandas decides on representing dates on a numeric scale?

Here's another example showing the problem and how the order matters. This time with two regular spaced series, so the initial title I used might not be so accurate. It seems to happen when pandas deals with known frequencies, either defined on the index or when its able to infer the frequency. Why does it convert the dates to different numeric values depending on the order of plotting?

import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd

N = 1000
y = np.linspace(0, 10, N)

base_dt = dt.datetime(2000,1,1)
dt_range = [base_dt + dt.timedelta(hours=x) for x in range(N)]
s = pd.Series(index=dt_range, data=y*2)

# Alignment is good when plotting with pyplot first
plt.figure()
plt.plot(dt_range,y, color='b')
s.plot(color='r')

# Alignment is bad when plotting with pandas.plot() first
plt.figure()
s.plot(color='r')
plt.plot(dt_range,y, color='b') # ends up in the 2050's

Looking at the xlim values in when plotting - very different values:

plt.figure()
s.plot(color='r')
plt.xlim() # (262968.0, 263967.0)

plt.figure()
plt.plot(dt_range,y, color='b')
s.plot(color='r')
plt.xlim() # (730120.0, 730161.625)
rhkarls commented 4 years ago

Came across this 3 year old issue #15071, and seems to be related to this.