mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.62k stars 1.94k forks source link

BUGFIX: Relplot Error adding refline when duplicate indicies present #3692

Open zacharygibbs opened 6 months ago

zacharygibbs commented 6 months ago

This is related to issue #3690

In this case, replot was creating duplicated data when it didn't need to when the input data had duplicate indicies, which caused the refline addition to fail.

I was able to solve this by modifying the grid_data merge step at the end of the relplot function to only merge when there's actually something to merge!

Print Debugging before fix (self.data - dataframe shape):

relplot (15000, 4)
relplot_before_grid_data (15000, 4)
relplot_after_grid_data (45000, 9)
main (45000, 9)
---------------------------------------------------------------------------
ValueError: operands could not be broadcast together with shapes (45000,) (15000,) 

Print Debugging After (self.data - dataframe shape)

relplot (15000, 4)
relplot_before_grid_data (15000, 4)
relplot_after_grid_data (15000, 4)
main (15000, 4)

reproducible example

(From original issue, for reference - reproducible example)

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

n_items = 5000
n_floats = 5
n_categorical = 3

df1 = pd.DataFrame(
    np.random.random((n_items, n_floats)),
    columns=[f'float{i}' for i in range(n_floats)]
)
df1 = df1.assign( **{f'categorical{i}': np.random.randint(0, 15, n_items) for i in range(n_categorical)})

df2 = pd.DataFrame(
    np.random.random((n_items, n_floats)),
    columns=[f'float{i}' for i in range(n_floats)]
)
df2 = df2.assign( **{f'categorical{i}': np.random.randint(0, 15, n_items) for i in range(n_categorical)})

df3 = pd.DataFrame(
    np.random.random((n_items, n_floats)),
    columns=[f'float{i}' for i in range(n_floats)]
)
df3 = df3.assign( **{f'categorical{i}': np.random.randint(0, 15, n_items) for i in range(n_categorical)})

df = pd.concat([df1.assign(origin=1), df2.assign(origin=2), df3.assign(origin=3)])
print(df)

fg=sns.relplot(data=df, x='float1', y='float2', hue='origin', row='categorical1')
print('main', fg.data.shape)
fg.refline(y=0.5)

plt.show()
zacharygibbs commented 5 months ago

Any word on this pull request? Seems like a simple fix, what's the hold up?