I have a string column with large number of unique values (~20% of number of rows). When I call value_counts() on this column, it returns a wrong result, counting some values only once.
import numpy as np
import vaex
rng = np.random.default_rng(42)
df = vaex.from_arrays(
x=[str(x) for x in rng.integers(low=0, high=10_000, size=100_000)]
)
vc1 = df["x"].value_counts()
print(vc1.sum())
Prints 95790 (or other numbers less than 100 000).
It happens only with string columns and only if a number of unique values is large enough (given code returns 100 000 if we change high to 1_000).
Description
I have a string column with large number of unique values (~20% of number of rows). When I call value_counts() on this column, it returns a wrong result, counting some values only once.
Prints
95790
(or other numbers less than 100 000).It happens only with string columns and only if a number of unique values is large enough (given code returns 100 000 if we change high to 1_000).
Software information
import vaex; vaex.__version__)
: {'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}Additional information
groupby with agg='count' works fine.