vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[FEATURE-REQUEST] vaex.agg.cpercent #2259

Closed bls-lehoai closed 1 year ago

bls-lehoai commented 1 year ago

Hi there, I need to calculate the percentage of each group after "group by" data ( group count / total count * 100% ). It's very convenient if there is a "vaex.agg.cpercent". Now I have to count by each group and take percentages by myself.

JovanVeljanoski commented 1 year ago

Hi,

That is already possible within the aggregation, for example:


import vaex

df = vaex.example()

# Option 1:
df_grouped = df.groupby('id').agg({'count_percentage': vaex.agg.count() / df.shape[0] * 100 })
print(df_grouped)

# Option 2 (probably what you are doing?)
df_grouped = df.groupby('id').agg({'count': vaex.agg.count()})
df_grouped['count_percentage'] = df_grouped['count'] / df.shape[0] * 100
print(df_grouped)

I think this is enough.. adding a specific aggregator to do the above would be possible.. but i feel it would bloat the API since it does not really add new functionality (it is a linear combination of existing stuff).

If you just feel like you need a shortcut in case you use this a lot in your project, you can probably make an extension yourself, following this part of the tutorial

Also, arithmetic combinations of existing aggregators is allowed, for example:

# following the example above
df.groupby('id').agg({'mean_over_std': vaex.agg.mean('x') / vaex.agg.std('y') })

Is this what you mean? it is possible I've misunderstood you completely..

bls-lehoai commented 1 year ago

@JovanVeljanoski Thank you so much!