vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] `get_groups` only returns the first group #1611

Open Ben-Epstein opened 3 years ago

Ben-Epstein commented 3 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description The get_groups function only returns the first group passed in for the dataframe.

import numpy as np
import vaex

n=1000
label = np.random.randint(0, 20, size=n)
pred = np.random.randint(0,20,size=n)

data = {'label': label, 'pred': pred}
df = vaex.from_arrays(**data)

# The counts below will vary a bit because i'm generating random data, but the point remains
print(df.groupby("label").get_group([0]).count()) # 32
print(df.groupby("label").get_group([1]).count()) # 52
print(df.groupby("label").get_group([0,1]).count()) # 32
print(df.groupby("label").get_group([1,0]).count()) # 52
print(df.groupby("label").get_group([0,1])["label"].unique()) # [0]

Software information

maartenbreddels commented 3 years ago

Hi Ben,

what should the expected behavior be, I don't fully understand what the issue is.

Regards,

Maarten

Ben-Epstein commented 3 years ago

@maartenbreddels I'd expect that when passing 2 or more values into get_group, you'd get the values for all groups. What we're seeing is that you only get the group for the first value in the list passed in.