pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.76k stars 17.96k forks source link

equality comparison with a scalar is slow for category (performance regression) #23814

Closed colinfang closed 5 years ago

colinfang commented 5 years ago

Are the following 2 ways to compare a series to a scalar equivalent (ignore missing values)? I have to write the hard way in order to take advantage of the category properties.

```python
x = pd.Series(list('abcd') * 1000000).astype('category')
%timeit x == 'a'
# 10 loops, best of 3: 25.2 ms per loop
%timeit x.cat.codes == x.cat.categories.get_loc('a')
# 1000 loops, best of 3: 750 µs per loop
```
TomAugspurger commented 5 years ago

It seems like a lot of time is spent in Categorical.__init__(Series[Categorical]), when that should essentially just be

if isinstance(data, (ABCSeries, ABCIndexClass)):
    data = data._values
if isinstance(data, type(self)):
    dtype = data.dtype
    codes = data.codes

We would need to handle not-default args for categories, ordered, and codes.

``` Timer unit: 1e-06 s Total time: 0.02405 s File: /Users/taugspurger/sandbox/pandas/pandas/core/arrays/categorical.py Function: __init__ at line 329 Line # Hits Time Per Hit % Time Line Contents ============================================================== 329 def __init__(self, values, categories=None, ordered=None, dtype=None, 330 fastpath=False): 331 332 # Ways of specifying the dtype (prioritized ordered) 333 # 1. dtype is a CategoricalDtype 334 # a.) with known categories, use dtype.categories 335 # b.) else with Categorical values, use values.dtype 336 # c.) else, infer from values 337 # d.) specifying dtype=CategoricalDtype and categories is an error 338 # 2. dtype is a string 'category' 339 # a.) use categories, ordered 340 # b.) use values.dtype 341 # c.) infer from values 342 # 3. dtype is None 343 # a.) use categories, ordered 344 # b.) use values.dtype 345 # c.) infer from values 346 1 2.0 2.0 0.0 if dtype is not None: 347 # The dtype argument takes precedence over values.dtype (if any) 348 if isinstance(dtype, compat.string_types): 349 if dtype == 'category': 350 dtype = CategoricalDtype(categories, ordered) 351 else: 352 msg = "Unknown `dtype` {dtype}" 353 raise ValueError(msg.format(dtype=dtype)) 354 elif categories is not None or ordered is not None: 355 raise ValueError("Cannot specify both `dtype` and `categories`" 356 " or `ordered`.") 357 358 categories = dtype.categories 359 360 1 13.0 13.0 0.1 elif is_categorical(values): 361 # If no "dtype" was passed, use the one from "values", but honor 362 # the "ordered" and "categories" arguments 363 1 5.0 5.0 0.0 dtype = values.dtype._from_categorical_dtype(values.dtype, 364 1 5.0 5.0 0.0 categories, ordered) 365 else: 366 # If dtype=None and values is not categorical, create a new dtype 367 dtype = CategoricalDtype(categories, ordered) 368 369 # At this point, dtype is always a CategoricalDtype 370 # if dtype.categories is None, we are inferring 371 372 1 1.0 1.0 0.0 if fastpath: 373 self._codes = coerce_indexer_dtype(values, categories) 374 self._dtype = self._dtype.update_dtype(dtype) 375 return 376 377 # null_mask indicates missing values we want to exclude from inference. 378 # This means: only missing values in list-likes (not arrays/ndframes). 379 1 10.0 10.0 0.0 null_mask = np.array(False) 380 381 # sanitize input 382 1 8.0 8.0 0.0 if is_categorical_dtype(values): 383 1 5.0 5.0 0.0 if dtype.categories is None: 384 dtype = CategoricalDtype(values.categories, dtype.ordered) 385 386 elif not isinstance(values, (ABCIndexClass, ABCSeries)): 387 # _sanitize_array coerces np.nan to a string under certain versions 388 # of numpy 389 values = maybe_infer_to_datetimelike(values, convert_dates=True) 390 if not isinstance(values, np.ndarray): 391 values = _convert_to_list_like(values) 392 from pandas.core.series import _sanitize_array 393 # By convention, empty lists result in object dtype: 394 if len(values) == 0: 395 sanitize_dtype = 'object' 396 else: 397 sanitize_dtype = None 398 null_mask = isna(values) 399 if null_mask.any(): 400 values = [values[idx] for idx in np.where(~null_mask)[0]] 401 values = _sanitize_array(values, None, dtype=sanitize_dtype) 402 403 1 2.0 2.0 0.0 if dtype.categories is None: 404 try: 405 codes, categories = factorize(values, sort=True) 406 except TypeError: 407 codes, categories = factorize(values, sort=False) 408 if dtype.ordered: 409 # raise, as we don't have a sortable data structure and so 410 # the user should give us one by specifying categories 411 raise TypeError("'values' is not ordered, please " 412 "explicitly specify the categories order " 413 "by passing in a categories argument.") 414 except ValueError: 415 416 # FIXME 417 raise NotImplementedError("> 1 ndim Categorical are not " 418 "supported at this time") 419 420 # we're inferring from values 421 dtype = CategoricalDtype(categories, dtype.ordered) 422 423 1 7.0 7.0 0.0 elif is_categorical_dtype(values): 424 1 394.0 394.0 1.6 old_codes = (values.cat.codes if isinstance(values, ABCSeries) 425 else values.codes) 426 1 5.0 5.0 0.0 codes = _recode_for_categories(old_codes, values.dtype.categories, 427 1 23530.0 23530.0 97.8 dtype.categories) 428 429 else: 430 codes = _get_codes_for_values(values, dtype.categories) 431 432 1 27.0 27.0 0.1 if null_mask.any(): 433 # Reinsert -1 placeholders for previously removed missing values 434 full_codes = - np.ones(null_mask.shape, dtype=codes.dtype) 435 full_codes[~null_mask] = codes 436 codes = full_codes 437 438 1 27.0 27.0 0.1 self._dtype = self._dtype.update_dtype(dtype) 439 1 9.0 9.0 0.0 self._codes = coerce_indexer_dtype(codes, dtype.categories) ```

@colinfang can you post your pandas version?

colinfang commented 5 years ago

I am using v0.23.4. The performance for v0.20.3 is good.

TomAugspurger commented 5 years ago

Can you see if my suggestion would fix it? I think it may need to happen even before https://github.com/pandas-dev/pandas/blob/029d57cb828b053d112ed4e23f446ee07d147935/pandas/core/arrays/categorical.py#L331, because we need to know whether the user passed None for all the optional parameters.

On Tue, Nov 20, 2018 at 6:30 AM colinfang notifications@github.com wrote:

I am using v0.23.4. The performance for v0.20.3 is good.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23814#issuecomment-440256954, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInyzbvgbkKtMKetkzNMTK_Uvb3AMks5uw_X2gaJpZM4Yq9mh .

eoveson commented 5 years ago

@TomAugspurger / @colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case.

I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)?

if isinstance(values, (ABCSeries, ABCIndexClass)):
    values = values._values
if isinstance(values, type(self)):
    self._dtype = values.dtype
    self._codes = values.codes
    return
In [1]: import pandas as pd

In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')

In [3]: %timeit s == 'a'
3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
TomAugspurger commented 5 years ago

I'd say go ahead and work on this.

You'd need to add an ASV in asv_bench/benchmarks/categoricals.py for timing Categorical(categorical).

On Tue, Nov 20, 2018 at 5:09 PM Erik notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger / @colinfang https://github.com/colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case.

I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang https://github.com/colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)?

if isinstance(values, (ABCSeries, ABCIndexClass)): values = values._valuesif isinstance(values, type(self)): self._dtype = values.dtype self._codes = values.codes return

In [1]: import pandas as pd

In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')

In [3]: %timeit s == 'a'3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23814#issuecomment-440461928, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIofWioeIOoIYkS8aCOjaEAU0tKVKks5uxIu9gaJpZM4Yq9mh .

eoveson commented 5 years ago

One option I saw may be to short-circuit within _recode_for_categories() (not sure if this would help in any other scenarios also). It may be safer, since all of the other logic for the other parameters runs as it did before? In _recode_for_categories(), I tried adding the following code:

if new_categories.equals(old_categories):
    # Same categories, so no need to actually recode
    return codes.copy()

Which yielded:

In [3]: %timeit s == 'a'
7.29 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.14 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If I instead add the following code basically at the beginning of init() (the approach first discussed):

if categories is None and ordered is None and dtype is None:
    if isinstance(values, (ABCSeries, ABCIndexClass)):
        values = values._values
    if isinstance(values, type(self)):
        self._dtype = values.dtype
        self._codes = values.codes.copy()
        return

This yielded:

In [3]: %timeit s == 'a'
5.38 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, not a huge difference between the two approaches. Also, I'm not sure whether I need to copy() the codes as I have in the above snippets. @TomAugspurger any thoughts? I can just send a PR and we can continue to discuss there if you prefer..

TomAugspurger commented 5 years ago

I think the change to Categorical.__init__ is clearly good.

The downside to adding a .equals check to _recode_for_categories is that you always have to pay the cost to check .equals, even if they aren't actually equal. I suspect that in many cases where you'd end up taking the shortcut in _recode_for_categories, you'll already know whether on not the categories are equal or not. Still, there may be some cases where you can't know that, so it may be OK to just check that.

eoveson commented 5 years ago

Sounds good, I will go with the init change.