vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.26k stars 590 forks source link

[BUG-REPORT] join(): rprefix not systematically taken into account #1590

Open yohplala opened 2 years ago

yohplala commented 2 years ago

Description Conducting join operations in a loop, while using rprefix parameter, I noticed that this parameter is not systematically used, always for the 2nd iteration of the loop. Why is that so?

Minimal Reproducible Example

import numpy as np
import vaex as vx

n_cols = 3
n_rows = 3
n_df = 3

# Generate test data: a list of vaex DataFrames, for joining in a 2nd step.
index = np.arange(n_rows)
x = np.ones(n_rows)
vdfs = []
for k in range(1,n_df+1):
    vdf = vx.from_arrays(**{f'x_{i}': x*i+1 for i in range(n_cols)},
                         timestamp=index)
    vdfs.append(vdf)

# Join with vaex.
left = None
for i, vdf in enumerate(vdfs):
    rprefix = f'vdf_{i}_'
    if i < 3:
        print(rprefix)
    try:
        left = left.join(vdf, on='timestamp', rprefix=rprefix, how='inner')\
                   .drop(rprefix+'timestamp')
        if i < 3:
            print(left.get_column_names())
            print('')
    except AttributeError:
        left = vdf
        for col in left.get_column_names(regex='^(?!timestamp)'):
            left.rename(col, rprefix+col)
        print(left.get_column_names())
        print('')

The print statements displays rprefix and the resulting column names as the vaex DataFrame left gets new columns, only for the first 3 iterations. It displays:

vdf_0_
['vdf_0_x_0', 'vdf_0_x_1', 'vdf_0_x_2', 'timestamp']

vdf_1_
['vdf_0_x_0', 'vdf_0_x_1', 'vdf_0_x_2', 'timestamp', 'x_0', 'x_1', 'x_2']

vdf_2_
['vdf_0_x_0', 'vdf_0_x_1', 'vdf_0_x_2', 'timestamp', 'x_0', 'x_1', 'x_2', 'vdf_2_x_0', 'vdf_2_x_1', 'vdf_2_x_2']

The trouble is the 2nd paragraph (2nd iteration). You can read 'x_0', 'x_1', 'x_2', which shows that rprefix is not used for this 2nd iteration. Expected column names should be 'vdf_1_x_0', 'vdf_1_x_1', 'vdf_1_x_2'.

We can see that with the 3rd iteration, rprefix is used ok. (and so for the following iterations)

Please, is this a bug?

Software information

maartenbreddels commented 2 years ago

Hi,

thanks for the report! @JovanVeljanoski i think this is expected behaviour, but I personally hate it, I think the prefix/suffix should always be used, but as it is now, it's only used when the column names collide. My guess this comes from pandas compatibility, but I'm happy to change this, if @JovanVeljanoski agrees, and someone wants to write a test for this.

cheers,

Maarten