Open jacek-pliszka opened 3 weeks ago
Thanks for the report. On that size of data, you're just measuring overhead.
size = 100_000
df=pd.DataFrame({"M": ["M1","M2"] * size, "P": ["P1", "P2"] * size, "V": [1.,2.] * size})
i = df.set_index(['M','P']).index
%timeit i.unique("M")
# 466 μs ± 3.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit i.get_level_values('M').drop_duplicates()
# 3.43 ms ± 12.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
.unique
is 7 times faster here.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
It is not very important but still quite surprising. unique should be the method to use and faster but is twice slower.
df=pd.DataFrame({"M": ["M1","M2"], "P": ["P1", "P2"], "V": [1.,2.]}) i = df.set_index(['M','P']).index
In [6]: %timeit i.unique("M") 30.9 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [7]: %timeit i.get_level_values('M').drop_duplicates() 16.1 µs ± 84 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Installed Versions
Prior Performance
No response