rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.48k stars 906 forks source link

[FEA] Cache the outputs of type utilities #12494

Open vyasr opened 1 year ago

vyasr commented 1 year ago

Is your feature request related to a problem? Please describe. The use of type utilities in cudf.api.types is widespread throughout cudf. Some of this usage is problematic and should be removed -- in many places, we would be better off leveraging the columnar type hierarchy in cudf to implement additional functionality as methods of the appropriate subclass of ColumnBase. However, there are also many places where the usage of these utility functions is unavoidable, especially in functions like astype or build_column that need to dispatch based on an input dtype. The problem is that these methods are slow, especially when compared to the equivalent isinstance checks for primitive types. For instance, pd.api.types.is_numeric_dtype takes ~1.2 us as compared to ~50-80 ns for isinstance(val, int). Although these absolute numbers are small, that factor of 10-20 is significant because many common cudf functions can call these functions dozens of times (consider e.g. merges or groupbys). We would benefit from reducing this overhead as much as possible.

Describe the solution you'd like These functions should all wrap their logic in a cache to bypass the calls on commonly reused types. Here is a simple example:

In [48]: from functools import lru_cache

In [49]: import pandas as pd

In [50]: @lru_cache
    ...: def is_numeric_type_cached(dtype):
    ...:     return pd.api.types.is_numeric_dtype(dtype)
    ...: 

In [51]: def is_numeric_type(arr_or_dtype):
    ...:     if dt := getattr(arr_or_dtype, "dtype", None):
    ...:         arr_or_dtype = dt
    ...:     return is_numeric_type_cached(arr_or_dtype)
    ...: 

In [52]: s = pd.Series([1])

In [53]: %timeit pd.api.types.is_numeric_dtype(s)
1.88 µs ± 7.23 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [54]: %timeit pd.api.types.is_numeric_dtype(s.dtype)
1.14 µs ± 7.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [55]: %timeit is_numeric_type(s.dtype)
471 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [56]: %timeit is_numeric_type(s)
466 ns ± 1.98 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Note the importance of the wrapper function to support calling these utilities with any object that exposes a dtype attribute such as a Series or a numpy/cupy array (we obviously don't want to cache those objects directly). We may need additional such logic in certain specific cases (perhaps categoricals or nested types). Nevertheless, as shown above using a cache could reduce the cost of the APIs of a factor of ~3 on average.

Describe alternatives you've considered A more natural alternative that I considered was leveraging functools.singledispatch. Unfortunately, we are largely foiled in this attempt by the fact that the different possible "dtype-like" objects do not fit into simple class delinations, which is how dispatch is performed. In the worst case, all dtypes are specified with strings like "int" or "float", all of which end up dispatched to the same type.

Additional context In theory this could be abused to result in some ridiculous caching, e.g. if a user calls these methods with some arbitrary object that isn't a valid dtype-like object and ends up being stored in the cache. To prevent the worst excesses we should limit the cache size.

shwina commented 1 year ago

I think there may be places that we call these utilities on things like Series objects. We could replace those instances with calling on the .dtype attribute instead.

Edit: apologies, I jumped the gun on responding :)

vyasr commented 6 months ago

Related: #5695