Closed rishi-kulkarni closed 6 months ago
Hi @rishi-kulkarni . Thanks for reaching out.
This does sound like a potentially useful feature. If I understand correctly the only change would be that instead of dummy encoding we generate one column per hash; and then pass it through the standard contrast codings. We could accomplish that today using: C(hash(x) % 10, levels=range(10))
or something similar.
I think my preferred implementation approach is to create a new transform (maybe something like hashed(x, contr.treatment, levels=10, seed=0)
) that wraps the C()
call above, rather than making the C()
transform itself more complex. This also means that hashed()
doesn't need to be stateful.
I am willing to review any PRs to this extent, and to provide any helpful guidance. I'm also happy to just implement it directly, if that is more helpful to you.
I can take a crack at it this weekend if you don't have time. That approach makes sense. I presume you don't want to take on any dependencies for this, but the built-in hash functions in Python are not super performant for this use case (for example, sklearn's hashing trick implementation has its own Cython extension using Murmur Hash). Do you think it'd be reasonable to provide a hash_func optional argument for users to bring their own?
I'm not going to get to it before this weekend :). And yes, I think it is more than reasonable to allow users to specify a hash.
Just wanted to check before I got too into it - this minimal implementation seems to be working well. Anything obviously wrong?
from hashlib import md5
from typing import Any, Callable, Dict, Hashable, List
import numpy as np
from formulaic.materializers.types import FactorValues
from .contrasts import encode_contrasts
def md5_to_int(s: str) -> int:
return int(md5(s.encode()).hexdigest(), 16)
def hashed(
data: Any,
*,
hash_func: Callable[Hashable, int] = md5_to_int,
levels: int = 2**20,
spans_intercept: bool = False,
):
def encoder(
values: Any,
reduced_rank: bool,
drop_rows: List[int],
encoder_state: Dict[str, Any],
model_spec,
):
values = np.array([hash(v) % levels for v in values])
return encode_contrasts(
values,
contrasts=None,
levels=np.arange(levels),
reduced_rank=reduced_rank,
_state=encoder_state,
_spec=model_spec,
)
return FactorValues(
data,
kind="categorical",
spans_intercept=spans_intercept,
encoder=encoder,
)
Hi @rishi-kulkarni!
This looks great. Here's a few thoughts:
contrasts
too. Even the columns are arbitrary and less meaningful, it doesn't seem crazy that you might want to customise the contrasts. Could be off the wall here.hashed
a stateful transform too, and pass through the encoder state and spec through to the encode_contrasts
method.md5
is not the fastest hashing algorithm, but I did some quick tests and for the performance was more than sufficient for even millions of rows. AND it doesn't introduce external deps; so let's roll with that.Will review any such PR when it is ready :).
Thanks again!
Sure, I opened a WIP PR here: https://github.com/matthewwardrop/formulaic/pull/168
For applications in sequential learning, it would be nice to have a stateful transform that doesn't presume every level of the categoricals is seen in the first chunk of training data. One approach used in sklearn is to include a FeatureHasher transformer, which I think would make sense here, too.
The hashing trick would fit in cleanly with the existing C() API - potentially taking arguments for a hash seed and total_features or something. I'd be happy to contribute a PR for this if you think it'd make sense to include in formulaic.