vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] value_counts() on string column returns wrong values #2146

Closed vonglod closed 2 years ago

vonglod commented 2 years ago

Description

I have a string column with large number of unique values (~20% of number of rows). When I call value_counts() on this column, it returns a wrong result, counting some values only once.

import numpy as np
import vaex

rng = np.random.default_rng(42)
df = vaex.from_arrays(
    x=[str(x) for x in rng.integers(low=0, high=10_000, size=100_000)]
)

vc1 = df["x"].value_counts()
print(vc1.sum())

Prints 95790 (or other numbers less than 100 000).

It happens only with string columns and only if a number of unique values is large enough (given code returns 100 000 if we change high to 1_000).

Software information

Additional information

groupby with agg='count' works fine.

JovanVeljanoski commented 2 years ago

Hi @adolganov

Thank you for reporting this, and for the clean example! Much appreciated. Will try to fix this soon!