vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Joining filtered datasets throws an error #2171

Closed maxharlow closed 2 years ago

maxharlow commented 2 years ago

I'm attempting to combine the columns from two filtered one-row dataframes. For example:

import vaex

df1 = vaex.from_arrays(numbers=['one', 'two', 'three'])
df2 = vaex.from_arrays(letters=['aaa', 'bbb', 'ccc'])

df1_filtered = df1[df1.numbers == 'two']

print('df1_filtered')
print(df1_filtered)

df2_filtered = df2[df2.letters == 'aaa']

print('df2_filtered')
print(df2_filtered)

joined = df1_filtered.join(df2_filtered)

print('joined')
print(joined)

This produces the following output:

df1_filtered
  #  numbers
  0  two
df2_filtered
  #  letters
  0  aaa
Traceback (most recent call last):
  File "/Users/maxharlow/Desktop/example.py", line 16, in <module>
    joined = df1_filtered.join(df2_filtered)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataframe.py", line 6686, in join
    return vaex.join.join(**kwargs)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/join.py", line 287, in join
    dataset = left.dataset.merged(right_dataset)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataset.py", line 1372, in merged
    return DatasetMerged(self, rhs)
  File "/opt/homebrew/lib/python3.9/site-packages/vaex/dataset.py", line 1194, in __init__
    raise ValueError(f'Merging datasets with unequal row counts ({self.left.row_count} != {self.right.row_count})')
ValueError: Merging datasets with unequal row counts (3 != 1)

I had expected the output to be a dataframe as so:

┌─────────┬─────────┐
│ numbers │ letters │
├─────────┼─────────┤
│ two     │ aaa     │
└─────────┴─────────┘

I'm using Python 3.9.13, and Vaex version: {'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}

JovanVeljanoski commented 2 years ago

Yeah, you need to do:

df1_filtered = df1_filtered.extract()
df2_filtered = df2_filtered.extract()

just before you do the join operation.

The explanation is basically the same as this one.

I hope this helps!

maxharlow commented 2 years ago

Ah! Thank you, I didn't realise that. Possibly worth adding something to the error message?

JovanVeljanoski commented 2 years ago

I think the message is clear enough if you are familiar with how vaex works.. i would like a more explicit message, but hard to know the intent of the user..