vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] rename when the new name is already a column has unexpected results #2343

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

so this is a tricky little bug

Because we were renaming but not dropping the original columns, sometimes vaex wouldn't overwrite correctly (I'll make an issue in the vaex github).

You can run these to understand the issue fully

import vaex
import numpy as np
df = vaex.example()[["x","y"]]
df["data_x"] = np.random.rand(len(df))
df["data_y"] = np.random.rand(len(df))

df.rename("data_x", "x")
df.rename("data_y", "y")

This will work as expected. The dataframe will show 2 columns, x, and y, and the values will match that of data_x and data_y

This will fail

df = vaex.from_arrays(
    data_x = np.random.rand(1000),
    data_y = np.random.rand(1000),
    x = np.random.rand(1000),
    y = np.random.rand(1000)
)
df.rename("data_x", "x")
df.rename("data_y", "y")

The reason has to do with the state. If you look at the state_get() of either dataframe df.state_get()

You'll see something like this

{'virtual_columns': {},
 'column_names': ['x', 'y', 'x', 'y'],
 'renamed_columns': [('data_x', 'x'), ('data_y', 'y')],
...
}

You see the columns are ["x", "y", "x", "y] The issue is that whichever x and y came second will be the ones used. So when we rename data_x and data_y, if they were "first" in the dataframe, the rename won't work as expected

What should happen?

Ideally, if the column already exists, it should be renamed to a hidden _column_ and the new one should take over.

But at the minimum, vaex should throw an error that you cannot rename to a column that already exists. One of these, but ideally the first

Software information