pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.93k forks source link

bug: plot two lines, unordered xlabels with type str #18687

Open boeddeker opened 6 years ago

boeddeker commented 6 years ago

Code Sample, a copy-pastable example if possible

# Your code here
%matplotlib inline

import pandas as pd
df1 = pd.DataFrame([{'x': '1', 'y': 1}, {'x': '2', 'y': 2}])
df2 = pd.DataFrame([{'x': '2', 'y': 3}, {'x': '1', 'y': 4}])

print(df1)
#    x  y
# 0  1  1
# 1  2  2
print(df2)
#    x  y
# 0  2  3
# 1  1  4

# Wrong plot
ax = None
ax = df1.plot('x', 'y', ax=ax)
ax = df2.plot('x', 'y', ax=ax)

# Correct plot
ax = None
ax = df1.sort_values(by=['x']).plot('x', 'y', ax=ax)
ax = df2.sort_values(by=['x']).plot('x', 'y', ax=ax)

Problem description

I want to plot multiple dataframes in one graph. The x values are strings. The x value order in both dataframes is different.

The first plot draws the line

Expected Output

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: de_DE.UTF-8 pandas: 0.21.0 pytest: 3.3.0 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.27.3 numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.0 openpyxl: 2.4.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
TomAugspurger commented 6 years ago

Strange, I'm not sure what's going on. You're welcome to take a look in pandas/plotting/_core.py if you're interested :)

boeddeker commented 6 years ago

Thanks for the hint to the file. I already took a look with pycharm, but I didn't locate the bug.

Licht-T commented 6 years ago

I am working on this. The current implementation ignores the order of string Index. https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L578

eg. This makes the same result.

%matplotlib inline
import pandas as pd

df1 = pd.DataFrame([{'x': 'a', 'y': 1}, {'x': 'b', 'y': 2}])
df2 = pd.DataFrame([{'x': 'b', 'y': 3}, {'x': 'a', 'y': 4}])

ax = None
ax = df1.plot('x', 'y', ax=ax)
ax = df2.plot('x', 'y', ax=ax)
Licht-T commented 6 years ago

There is only one solution; converting a string Index into the numeric one if available. I don't know whether pandas should support such conversion in its internals.

boeddeker commented 6 years ago

In the case where I hit the problem, I had strings that are not convertible to floats. Your example highlights the error better.

Converting strings to floats would reduce the occurrence of this bug. Maybe handling strings need another solution.

An idea: The strings (labels) can be stored in xticklabels. If the labels are string inside the _get_xticks the xticklabels are read, append with missing labels and the xticks are calculated from them. This would require the ax object in _get_xticks.

boeddeker commented 6 years ago

I have now example code that demonstrates my idea.

%matplotlib inline
import pandas as pd

df1 = pd.DataFrame([{'x': 'a', 'y': 1}, {'x': 'b', 'y': 2}])
df2 = pd.DataFrame([{'x': 'b', 'y': 3}, {'x': 'a', 'y': 4}])

def df_xstr_plot(df, x=None, y=None, ax=None):
    df = df.copy()

    if ax is not None:
        tick_labels = list(map(
            (lambda tick_label: tick_label.get_text()), 
            ax.get_xticklabels()
        ))
    else:
        tick_labels = []

    for new_tick_label in df[x]:
        if new_tick_label not in tick_labels:
            tick_labels.append(new_tick_label)

    # map str to int
    mapping = {tick_label: i for i, tick_label in enumerate(tick_labels)}
    df['x'] = df['x'].apply(lambda x: mapping[x])

    ax = df.plot(x, y, ax=ax)

    # Assign the correct xticklabels
    ax.set_xticks(list(range(len(tick_labels))))
    ax.set_xticklabels(tick_labels)

    return ax

ax = None
ax = df_xstr_plot(df1, 'x', 'y', ax=ax)
ax = df_xstr_plot(df2, 'x', 'y', ax=ax)  # correct

ax = None
ax = df1.plot('x', 'y', ax=ax)
ax = df2.plot('x', 'y', ax=ax)  # wrong
TomAugspurger commented 6 years ago

This seems a bit complex. I don't think pandas should be doing anything special here, we should rely on matplotlib to handle all the string <-> position logic.

boeddeker commented 6 years ago

You are right, I forgot to test if matplotlib can handle strings. So the solution would be to add a further branch to _get_xticks for strings, that does not convert the strings to int.

%matplotlib inline
import pandas as pd
import matplotlib.pylab as plt

df1 = pd.DataFrame([{'x': 'a', 'y': 1}, {'x': 'b', 'y': 2}])
df2 = pd.DataFrame([{'x': 'b', 'y': 3}, {'x': 'a', 'y': 4}])

def df_xstr_plot(df, x=None, y=None, ax=None):
    if ax is None:
        figure, ax = plt.subplots(1, 1)

    ax.plot(df[x], df[y])
    return ax

ax = None
ax = df_xstr_plot(df1, 'x', 'y', ax=ax)
ax = df_xstr_plot(df2, 'x', 'y', ax=ax)  # correct

ax = None
ax = df1.plot('x', 'y', ax=ax)
ax = df2.plot('x', 'y', ax=ax)  # wrong