Support the hashing trick as an encoding strategy for categorical features

matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.

MIT License

313 stars 21 forks source link

Support the hashing trick as an encoding strategy for categorical features #164

Closed rishi-kulkarni closed 6 months ago

rishi-kulkarni commented 7 months ago

For applications in sequential learning, it would be nice to have a stateful transform that doesn't presume every level of the categoricals is seen in the first chunk of training data. One approach used in sklearn is to include a FeatureHasher transformer, which I think would make sense here, too.

The hashing trick would fit in cleanly with the existing C() API - potentially taking arguments for a hash seed and total_features or something. I'd be happy to contribute a PR for this if you think it'd make sense to include in formulaic.

matthewwardrop commented 7 months ago

Hi @rishi-kulkarni . Thanks for reaching out.

This does sound like a potentially useful feature. If I understand correctly the only change would be that instead of dummy encoding we generate one column per hash; and then pass it through the standard contrast codings. We could accomplish that today using: C(hash(x) % 10, levels=range(10)) or something similar.

I think my preferred implementation approach is to create a new transform (maybe something like hashed(x, contr.treatment, levels=10, seed=0)) that wraps the C() call above, rather than making the C() transform itself more complex. This also means that hashed() doesn't need to be stateful.

I am willing to review any PRs to this extent, and to provide any helpful guidance. I'm also happy to just implement it directly, if that is more helpful to you.

rishi-kulkarni commented 7 months ago

I can take a crack at it this weekend if you don't have time. That approach makes sense. I presume you don't want to take on any dependencies for this, but the built-in hash functions in Python are not super performant for this use case (for example, sklearn's hashing trick implementation has its own Cython extension using Murmur Hash). Do you think it'd be reasonable to provide a hash_func optional argument for users to bring their own?

matthewwardrop commented 7 months ago

I'm not going to get to it before this weekend :). And yes, I think it is more than reasonable to allow users to specify a hash.

rishi-kulkarni commented 7 months ago

Just wanted to check before I got too into it - this minimal implementation seems to be working well. Anything obviously wrong?


from hashlib import md5
from typing import Any, Callable, Dict, Hashable, List

import numpy as np

from formulaic.materializers.types import FactorValues

from .contrasts import encode_contrasts

def md5_to_int(s: str) -> int:
    return int(md5(s.encode()).hexdigest(), 16)

def hashed(
    data: Any,
    *,
    hash_func: Callable[Hashable, int] = md5_to_int,
    levels: int = 2**20,
    spans_intercept: bool = False,
):
    def encoder(
        values: Any,
        reduced_rank: bool,
        drop_rows: List[int],
        encoder_state: Dict[str, Any],
        model_spec,
    ):
        values = np.array([hash(v) % levels for v in values])
        return encode_contrasts(
            values,
            contrasts=None,
            levels=np.arange(levels),
            reduced_rank=reduced_rank,
            _state=encoder_state,
            _spec=model_spec,
        )

    return FactorValues(
        data,
        kind="categorical",
        spans_intercept=spans_intercept,
        encoder=encoder,
    )

matthewwardrop commented 6 months ago

Hi @rishi-kulkarni!

This looks great. Here's a few thoughts:

You may want to pass through contrasts too. Even the columns are arbitrary and less meaningful, it doesn't seem crazy that you might want to customise the contrasts. Could be off the wall here.
Maybe we should make the number of hash levels a required parameter? 2**20 seems very specific, and also very large.
For the encoder state to work you will need to make hashed a stateful transform too, and pass through the encoder state and spec through to the encode_contrasts method.
md5 is not the fastest hashing algorithm, but I did some quick tests and for the performance was more than sufficient for even millions of rows. AND it doesn't introduce external deps; so let's roll with that.

Will review any such PR when it is ready :).

Thanks again!

rishi-kulkarni commented 6 months ago

Sure, I opened a WIP PR here: https://github.com/matthewwardrop/formulaic/pull/168