mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.18k stars 1.89k forks source link

Relplot refline error in situations when using dataframes w/ duplicate indicies #3690

Open zacharygibbs opened 1 month ago

zacharygibbs commented 1 month ago

This error occurs when using relplot then adding a refline. From investigation, it appears that replot is duplicating the data. In this example, when i concatentate my 3 dataframes, I did not use 'ignore_index'; therefore there are duplicate indicies in the input data.

The problem is solved when I use ignore_index, or feed the data in with as df.reset_index(), however, the error message was not useful in discovering this! After tracking down the relplot source code, it appears the problem is related to the grid_data merging at the end of the function. I was able to solve this by skipping the "merge" if all of the columns are already present. I have submitted a pull request #3692 .

the error was: ValueError: operands could not be broadcast together with shapes (45000,) (15000,) (the input data was 15000 rows long with 3 different "hue" variables)

Reproducible example:

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

n_items = 5000
n_floats = 5
n_categorical = 3

df1 = pd.DataFrame(
    np.random.random((n_items, n_floats)),
    columns=[f'float{i}' for i in range(n_floats)]
)
df1 = df1.assign( **{f'categorical{i}': np.random.randint(0, 15, n_items) for i in range(n_categorical)})

df2 = pd.DataFrame(
    np.random.random((n_items, n_floats)),
    columns=[f'float{i}' for i in range(n_floats)]
)
df2 = df2.assign( **{f'categorical{i}': np.random.randint(0, 15, n_items) for i in range(n_categorical)})

df3 = pd.DataFrame(
    np.random.random((n_items, n_floats)),
    columns=[f'float{i}' for i in range(n_floats)]
)
df3 = df3.assign( **{f'categorical{i}': np.random.randint(0, 15, n_items) for i in range(n_categorical)})

df = pd.concat([df1.assign(origin=1), df2.assign(origin=2), df3.assign(origin=3)])
print(df)

fg=sns.relplot(data=df, x='float1', y='float2', hue='origin', row='categorical1')
print('main', fg.data.shape)
fg.refline(y=0.5)

plt.show()