Closed mikapfl closed 1 year ago
Some benchmarking for a specific optimization for the common case of setting everything to a value which is already in the index:
In [31]: import pandas as pd
...:
...: mi = pd.MultiIndex.from_frame(pd.DataFrame({"eins": [str(x) for x in range(5000)], "zwei": [str(-x) for x in range(5000)]}))
...:
...: def set_name_via_df(mi, name, value):
...: df = mi.to_frame()
...: df[name] = value
...: mi = pd.MultiIndex.from_frame(df.astype("category"))
...: return mi
...:
...: def set_name_via_codes(mi, name, value):
...: level_i = mi.names.index(name)
...: value_i = mi.levels[level_i].to_list().index(value)
...: return mi.set_codes([value_i]*len(mi), level=name)
...:
...: %timeit set_name_via_df(mi, "zwei", "-300")
...:
...: %timeit set_name_via_codes(mi, "zwei", "-300")
...:
2.34 ms ± 93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
307 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
So, this gets us a 10x speedup in this benchmark. Unfortunately, a bit limited in scope. But probably easy to change to also work with values not yet in the index, as long as it only one value. For multiple values, it gets a bit more interesting.
Slightly slower, but works for a value not in the current index:
...: def set_name_via_codes_levels(mi, name, value):
...: return mi.set_codes([0]*len(mi), level=name).set_levels([value], level=name)
...:
...: %timeit set_name_via_codes_levels(mi, "zwei", "300")
...:
404 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Obvious enhancement which makes it almost as fast:
...: def set_name_via_codes_levels(mi, name, value):
...: return mi.set_codes(np.zeros(len(mi), dtype=int), level=name).set_levels([value], level=name)
...:
...: %timeit set_name_via_codes_levels(mi, "zwei", "300")
322 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Is your feature request related to a problem? Please describe.
In a fairly sizeable data processing pipeline, almost 1/3 of the time is spent in
ScmRun.__setitem__
, which seems excessive. Most of that time is spent inpd.MultiIndex.to_frame
andpd.MultiIndex.from_frame
see https://github.com/openscm/scmdata/blob/4993a1a274a1759061136689badf9606c4a45705/src/scmdata/run.py#L670 . Also the conversion to categorical dtypes seems to be a major time sink, in the processing pipeline,pd.Categorical.__init__
is called half a million times and consumes 17 % of the total time.Describe the solution you'd like
I hope, given that the MultiIndex is already a categorical, some alternative formulation can avoid most of the conversions.
Describe alternatives you've considered
Another solution would be to offer some higher-level
set_meta
like function which works in-place and offers the possibility to update multiple columns in one step, so that a user can batch metadata operations to avoid some of the overhead incurred by framing the MultiIndex.