Closed colinfang closed 5 years ago
It seems like a lot of time is spent in Categorical.__init__(Series[Categorical])
, when that should essentially just be
if isinstance(data, (ABCSeries, ABCIndexClass)):
data = data._values
if isinstance(data, type(self)):
dtype = data.dtype
codes = data.codes
We would need to handle not-default args for categories
, ordered
, and codes
.
@colinfang can you post your pandas version?
I am using v0.23.4. The performance for v0.20.3 is good.
Can you see if my suggestion would fix it? I think it may need to happen even before https://github.com/pandas-dev/pandas/blob/029d57cb828b053d112ed4e23f446ee07d147935/pandas/core/arrays/categorical.py#L331, because we need to know whether the user passed None for all the optional parameters.
On Tue, Nov 20, 2018 at 6:30 AM colinfang notifications@github.com wrote:
I am using v0.23.4. The performance for v0.20.3 is good.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23814#issuecomment-440256954, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInyzbvgbkKtMKetkzNMTK_Uvb3AMks5uw_X2gaJpZM4Yq9mh .
@TomAugspurger / @colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case.
I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)?
if isinstance(values, (ABCSeries, ABCIndexClass)):
values = values._values
if isinstance(values, type(self)):
self._dtype = values.dtype
self._codes = values.codes
return
In [1]: import pandas as pd
In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')
In [3]: %timeit s == 'a'
3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'd say go ahead and work on this.
You'd need to add an ASV in asv_bench/benchmarks/categoricals.py for timing Categorical(categorical).
On Tue, Nov 20, 2018 at 5:09 PM Erik notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger / @colinfang https://github.com/colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case.
I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang https://github.com/colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)?
if isinstance(values, (ABCSeries, ABCIndexClass)): values = values._valuesif isinstance(values, type(self)): self._dtype = values.dtype self._codes = values.codes return
In [1]: import pandas as pd
In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')
In [3]: %timeit s == 'a'3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23814#issuecomment-440461928, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIofWioeIOoIYkS8aCOjaEAU0tKVKks5uxIu9gaJpZM4Yq9mh .
One option I saw may be to short-circuit within _recode_for_categories() (not sure if this would help in any other scenarios also). It may be safer, since all of the other logic for the other parameters runs as it did before? In _recode_for_categories(), I tried adding the following code:
if new_categories.equals(old_categories):
# Same categories, so no need to actually recode
return codes.copy()
Which yielded:
In [3]: %timeit s == 'a'
7.29 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.14 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If I instead add the following code basically at the beginning of init() (the approach first discussed):
if categories is None and ordered is None and dtype is None:
if isinstance(values, (ABCSeries, ABCIndexClass)):
values = values._values
if isinstance(values, type(self)):
self._dtype = values.dtype
self._codes = values.codes.copy()
return
This yielded:
In [3]: %timeit s == 'a'
5.38 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, not a huge difference between the two approaches. Also, I'm not sure whether I need to copy() the codes as I have in the above snippets. @TomAugspurger any thoughts? I can just send a PR and we can continue to discuss there if you prefer..
I think the change to Categorical.__init__
is clearly good.
The downside to adding a .equals
check to _recode_for_categories
is that you always have to pay the cost to check .equals
, even if they aren't actually equal. I suspect that in many cases where you'd end up taking the shortcut in _recode_for_categories
, you'll already know whether on not the categories are equal or not. Still, there may be some cases where you can't know that, so it may be OK to just check that.
Sounds good, I will go with the init change.
Are the following 2 ways to compare a series to a scalar equivalent (ignore missing values)? I have to write the hard way in order to take advantage of the category properties.