Open vyasr opened 1 year ago
I think there may be places that we call these utilities on things like Series objects. We could replace those instances with calling on the .dtype
attribute instead.
Edit: apologies, I jumped the gun on responding :)
Related: #5695
Is your feature request related to a problem? Please describe. The use of type utilities in
cudf.api.types
is widespread throughout cudf. Some of this usage is problematic and should be removed -- in many places, we would be better off leveraging the columnar type hierarchy in cudf to implement additional functionality as methods of the appropriate subclass ofColumnBase
. However, there are also many places where the usage of these utility functions is unavoidable, especially in functions likeastype
orbuild_column
that need to dispatch based on an input dtype. The problem is that these methods are slow, especially when compared to the equivalentisinstance
checks for primitive types. For instance,pd.api.types.is_numeric_dtype
takes ~1.2 us as compared to ~50-80 ns forisinstance(val, int)
. Although these absolute numbers are small, that factor of 10-20 is significant because many common cudf functions can call these functions dozens of times (consider e.g. merges or groupbys). We would benefit from reducing this overhead as much as possible.Describe the solution you'd like These functions should all wrap their logic in a cache to bypass the calls on commonly reused types. Here is a simple example:
Note the importance of the wrapper function to support calling these utilities with any object that exposes a
dtype
attribute such as a Series or a numpy/cupy array (we obviously don't want to cache those objects directly). We may need additional such logic in certain specific cases (perhaps categoricals or nested types). Nevertheless, as shown above using a cache could reduce the cost of the APIs of a factor of ~3 on average.Describe alternatives you've considered A more natural alternative that I considered was leveraging
functools.singledispatch
. Unfortunately, we are largely foiled in this attempt by the fact that the different possible "dtype-like" objects do not fit into simple class delinations, which is how dispatch is performed. In the worst case, all dtypes are specified with strings like "int" or "float", all of which end up dispatched to the same type.Additional context In theory this could be abused to result in some ridiculous caching, e.g. if a user calls these methods with some arbitrary object that isn't a valid dtype-like object and ends up being stored in the cache. To prevent the worst excesses we should limit the cache size.