Open rhkarls opened 4 years ago
I think I too have encountered the same/similar issue.
I have irregular transactional data with a date timestamp. I perform some grouping and aggregating of the data to produce uniform mean balances. When plotting data with different timestamps there is an issue with an offset in the data representation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
tr = pd.date_range("2014-04-01", "2019-12-31")
n = 300
ts = np.random.choice(tr, n)
tx = np.random.sample(size=n) * np.random.choice(
(10, -15, -1, -0.5), n, p=[0.1, 0.025, 0.8, 0.075]
)
df = pd.DataFrame(data={"date": ts, "tx": tx})
df = df.sort_values(by=["date"]).reset_index()
df["balance"] = df["tx"].cumsum()
fig, axs = plt.subplots(2, 2)
mbalance = df.groupby(pd.Grouper(key="date", freq="M"))[["balance"]].mean()
print("monthly")
print(mbalance)
mbalance.plot(linestyle="none", marker="o", title="month", ax=axs[0, 0])
# company year ending is March
qbalance = df.groupby(pd.Grouper(key="date", freq="Q-MAR"))[["balance"]].mean()
print("quarterly")
print(qbalance)
qbalance.plot(linestyle="none", marker="s", title="quarter", ax=axs[0, 1])
abalance = df.groupby(pd.Grouper(key="date", freq="A-MAR"))[["balance"]].mean()
print("annually")
print(abalance)
abalance.plot(linestyle="none", marker="d", title="annum", ax=axs[1, 0])
mbalance.plot(linestyle="none", marker="o", ax=axs[1, 1])
qbalance.plot(linestyle="none", marker="s", ax=axs[1, 1])
abalance.plot(linestyle="none", marker="d", ax=axs[1, 1])
plt.legend()
When plotted onto individual subplot axes the data is rendered correctly (each is as expected). When the datasets are overlaid the quarterly and annual data are offset from their true position.
A further example to present the issue. When the original data is plotted first there is no issue with alignment (all display correctly).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
tr = pd.date_range("2014-04-01", "2019-12-31")
n = 300
ts = np.random.choice(tr, n)
tx = np.random.sample(size=n) * np.random.choice(
(10, -15, -1, -0.5), n, p=[0.1, 0.025, 0.8, 0.075]
)
df = pd.DataFrame(data={"date": ts, "tx": tx})
df = df.sort_values(by=["date"]).reset_index()
df["balance"] = df["tx"].cumsum()
fig, axs = plt.subplots(1, 2)
mbalance = df.groupby(pd.Grouper(key="date", freq="M"))[["balance"]].mean()
print("monthly")
print(mbalance)
qbalance = df.groupby(pd.Grouper(key="date", freq="Q-MAR"))[["balance"]].mean()
print("quarterly")
print(qbalance)
abalance = df.groupby(pd.Grouper(key="date", freq="A-MAR"))[["balance"]].mean()
print("annually")
print(abalance)
df.plot(x="date", y="balance", marker="+", ax=axs[0])
mbalance.plot(linestyle="none", marker="o", ax=axs[0])
qbalance.plot(linestyle="none", marker="s", ax=axs[0])
abalance.plot(linestyle="none", marker="d", ax=axs[0])
plt.legend()
mbalance.plot(linestyle="none", marker="o", ax=axs[1])
qbalance.plot(linestyle="none", marker="s", ax=axs[1])
abalance.plot(linestyle="none", marker="d", ax=axs[1])
plt.legend()
Since the order of plotting clearly matters it might be how pandas decides on representing dates on a numeric scale?
Here's another example showing the problem and how the order matters. This time with two regular spaced series, so the initial title I used might not be so accurate. It seems to happen when pandas deals with known frequencies, either defined on the index or when its able to infer the frequency. Why does it convert the dates to different numeric values depending on the order of plotting?
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd
N = 1000
y = np.linspace(0, 10, N)
base_dt = dt.datetime(2000,1,1)
dt_range = [base_dt + dt.timedelta(hours=x) for x in range(N)]
s = pd.Series(index=dt_range, data=y*2)
# Alignment is good when plotting with pyplot first
plt.figure()
plt.plot(dt_range,y, color='b')
s.plot(color='r')
# Alignment is bad when plotting with pandas.plot() first
plt.figure()
s.plot(color='r')
plt.plot(dt_range,y, color='b') # ends up in the 2050's
Looking at the xlim
values in when plotting - very different values:
plt.figure()
s.plot(color='r')
plt.xlim() # (262968.0, 263967.0)
plt.figure()
plt.plot(dt_range,y, color='b')
s.plot(color='r')
plt.xlim() # (730120.0, 730161.625)
Came across this 3 year old issue #15071, and seems to be related to this.
Code Sample, a copy-pastable example if possible
Problem description
Data on the second axis is not plotted correctly on the X axis. The irregular point data is shifted relative to the other data. It seems this is caused by the uncommon timestamp between the first entry of the irregular timeseries and the regular timeseries.
The following changes causes the data to be plotted correctly, I assume due to having matching timestamps of the first entries:
Changing the plot order for the second axis also causes the data to be plotted on the correct place along the X axis, so plotting the irregular timeseries before the regular:
p2.plot(ax=axs_ts[1],style='o', zorder=10)
ts2.plot(ax=axs_ts[1], zorder=1)
This does however cause other issues such as different X axis labels, and using this method it will also fail if using sharex=True.
Possibly related issues:
11574 - Misaligned x axis using shared axis (one series plotted per axis), not when plotted on the same axis as here.
18571 - Misaligned x axis using twinx(), possibly same issue as here?
Expected Output
Plotting the data at the correct x axis coordinates.
Output of
pd.show_versions()