(Potentially) faster-than-Pandas pd.factorize, demo code for discussion

Hello. Thanks for your library, it has been on my radar for a while. I just spoke about it at NDR.ai & DevDays conferences in my "Faster Pandas" talk. I'm the co-author of O'Reilly's High Performance Python book (2nd ed contains Pandas tips) and I teach a public course on this topic. Getting numeric Python code to run faster is a big deal for me.

I was playing with your aggregate using Numba and it works nicely for my limited testing. I realised that the need to make categories up-front is a bit of a blocker so I've done some digging. NumPy's unique function sorts data before building the key set so it is slow (I have no idea why it does a sort). Pandas' factorize method doesn't sort and is very fast. With a bit of messing around I came up with a faster solution for some data distributions which, possibly, could be added to your library (or maybe just left here).

I'm including a code sample for discussion, if you're interested.

ser = pd.Series(np.random.randint(low=0, high=2, size=10_000_000)) # e.g. 1, 0, 0, 1, 1, 1, 0, ...

# first try Pandas' factorize
%%time
pd.factorize(ser)
Wall time: 1.43 s
(array([0, 0, 1, ..., 1, 0, 1]), Int64Index([1, 0], dtype='int64'))

# next try my SLOW first (easy) implementation
%%time
print(ser.shape)
make_categories_v1(ser.to_numpy())
Wall time: 4.6 s
(array([0, 0, 1, ..., 1, 0, 1]), DictType[int64,int64]<iv=None>({1: 0, 0: 1}))

# fast (but not as fast as pandas) alternative for this random dataset
%%time
print(ser.shape)
make_categories_v3(ser.to_numpy())
Wall time: 1.67 s
(array([0, 0, 1, ..., 1, 0, 1], dtype=int8),

# Using straight runs of the same data
ser = pd.Series(np.repeat([0, 1], 100_000_000/2)) # e.g. 0, 0, ..., 1, 1, ..., 1

# My numba'd solution is faster under certain data conditions, if we have long-enough runs of the same data
%%time
print(ser.shape)
make_categories_v3(ser.to_numpy())
Wall time: 208 ms
(array([0, 0, 0, ..., 1, 1, 1], dtype=int8),
 DictType[int8,int8]<iv=None>({0: 0, 1: 1}))

%%time
print(ser.shape)
pd.factorize(ser)
Wall time: 1.44 s
(array([0, 0, 0, ..., 1, 1, 1]), Int64Index([0, 1], dtype='int64'))

Code:

from numba import njit
from numba.typed import Dict
from numba.core import types

@njit(nogil=True)
def make_categories_v1(arr):
    """Simple implementation, category_map of codes to arr values"""
    top_code = 0
    #uniques[key:code]
    # NOTE hardcoded to int64 keys and values
    # as I can't figure out how to use arr.dtype
    category_map = Dict.empty(key_type=types.int64, value_type=types.int64)
    inverse = np.empty(arr.shape, dtype=np.int64) # uninitialised block of memory
    for idx, val in enumerate(arr):
        if val not in category_map:
            code = top_code
            category_map[val] = top_code
            top_code += 1
        else:
            code = category_map[val]
        inverse[idx] = code
    return inverse, category_map

@njit(nogil=True)
def make_categories_v3(arr):
    """Keep using last value to avoid lookups if possible - fastest so far"""
    arr_type = types.int8
    code_type = types.int8
    category_map = Dict.empty(key_type=arr_type, value_type=code_type)
    inverse = np.empty(arr.shape, dtype=code_type) 
    # record the first (and running-previous) item seen
    # for a fast lookup to avoid touching the Dict if
    # we keep seeing the same item
    # for the first item we take the 0th items
    last_code = 0 
    last_val = arr[0]
    category_map[last_val] = last_code
    inverse[0] = last_code
    # for >0 items we need to count from 1
    top_code: code_type = 1
    for idx, val in enumerate(arr[1:]):
        if val == last_val:
            code = last_code
        else:
            try:
                code = category_map[val]
            except:
                # note naked except required by numba!
                # ideally we'd only except a KeyError
                # https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#pysupported-exception-handling
                code = top_code
                category_map[val] = top_code
                top_code += 1
            last_code = code
            last_val = val
        inverse[idx+1] = code
        #print(val in uniques)
    return inverse, category_map

Thoughts?

I think at the least on the README telling Pandas uses to prefer pd.factorize or if they've used a category encoding to use ser.cat.codes as they've been pre-computed would be useful advice.

If you haven't looked behind a Pandas category then this slide shows the .cat attribute: https://speakerdeck.com/ianozsvald/ndr-2021?slide=14

The dataset for that talk is the UK Government House Price dataset, 25M rows since 1995 of house sales and the distribution of "is new vs is old" is very clumpy, so my v3 algorithm above outperforms factorize. Maybe that helps other folk with similarly-clumpy data.

ml31415 / numpy-groupies

(Potentially) faster-than-Pandas pd.factorize, demo code for discussion #37