vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] nunique Aggregation differs in result with the datatype of the column. #2197

Closed vignesh-bungee closed 1 year ago

vignesh-bungee commented 2 years ago

Hi Vaex Team, I have a column as source_sku_upc, (dtype : string) and another column source_sku_upc_id (dtype : int64). source_sku_upc_id is nothing but an id column of the source_sku_upc. Doing nunique on the source_sku_upc returns correct result, but doing unique on source_sku_upc_id columns returns a number which is always correct result +1 for each group. On Inspecting it seems that 0 is also included from somewhere. It seems to be a bug, can you please confirm ?

Software information

Additional information Jupyter notebook and sample data attached image

vignesh-bungee commented 2 years ago

ErrorSampleData.csv

JovanVeljanoski commented 2 years ago

Hi!

Thanks for reporting this. It looks like a bug to me. Let me see if I can create an elegant test for this and so we can fix it quickly hopefully.

Cheers, J.