openscm / scmdata

Handling of Simple Climate Model data
https://scmdata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

`ScmRun.__setitem__` is slow #246

Closed mikapfl closed 1 year ago

mikapfl commented 1 year ago

Is your feature request related to a problem? Please describe.

In a fairly sizeable data processing pipeline, almost 1/3 of the time is spent in ScmRun.__setitem__, which seems excessive. Most of that time is spent in pd.MultiIndex.to_frame and pd.MultiIndex.from_frame see https://github.com/openscm/scmdata/blob/4993a1a274a1759061136689badf9606c4a45705/src/scmdata/run.py#L670 . Also the conversion to categorical dtypes seems to be a major time sink, in the processing pipeline, pd.Categorical.__init__ is called half a million times and consumes 17 % of the total time.

Describe the solution you'd like

I hope, given that the MultiIndex is already a categorical, some alternative formulation can avoid most of the conversions.

Describe alternatives you've considered

Another solution would be to offer some higher-level set_meta like function which works in-place and offers the possibility to update multiple columns in one step, so that a user can batch metadata operations to avoid some of the overhead incurred by framing the MultiIndex.

mikapfl commented 1 year ago

Some benchmarking for a specific optimization for the common case of setting everything to a value which is already in the index:

In [31]: import pandas as pd
    ...: 
    ...: mi = pd.MultiIndex.from_frame(pd.DataFrame({"eins": [str(x) for x in range(5000)], "zwei": [str(-x) for x in range(5000)]}))
    ...: 
    ...: def set_name_via_df(mi, name, value):
    ...:     df = mi.to_frame()
    ...:     df[name] = value
    ...:     mi = pd.MultiIndex.from_frame(df.astype("category"))
    ...:     return mi
    ...: 
    ...: def set_name_via_codes(mi, name, value):
    ...:     level_i = mi.names.index(name)
    ...:     value_i = mi.levels[level_i].to_list().index(value)
    ...:     return mi.set_codes([value_i]*len(mi), level=name)
    ...: 
    ...: %timeit set_name_via_df(mi, "zwei", "-300")
    ...: 
    ...: %timeit set_name_via_codes(mi, "zwei", "-300")
    ...: 
2.34 ms ± 93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
307 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So, this gets us a 10x speedup in this benchmark. Unfortunately, a bit limited in scope. But probably easy to change to also work with values not yet in the index, as long as it only one value. For multiple values, it gets a bit more interesting.

mikapfl commented 1 year ago

Slightly slower, but works for a value not in the current index:

    ...: def set_name_via_codes_levels(mi, name, value):
    ...:     return mi.set_codes([0]*len(mi), level=name).set_levels([value], level=name)
    ...: 
    ...: %timeit set_name_via_codes_levels(mi, "zwei", "300")
    ...: 
404 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
mikapfl commented 1 year ago

Obvious enhancement which makes it almost as fast:

    ...: def set_name_via_codes_levels(mi, name, value):
    ...:     return mi.set_codes(np.zeros(len(mi), dtype=int), level=name).set_levels([value], level=name)
    ...: 
    ...: %timeit set_name_via_codes_levels(mi, "zwei", "300")
322 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
mikapfl commented 1 year ago

See https://github.com/openscm/scmdata/pull/247