matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

[FEAT] Adds a transform to perform the hashing trick #168

Closed rishi-kulkarni closed 6 months ago

rishi-kulkarni commented 6 months ago

I went with the approach of not making hashed a stateful transform and just passing an empty dict to encode_contrasts. Is there any issue with that approach?

What sort of tests would you want to see here? Given the very pluggable nature of formulaic here, this feature ended up being very little code.

Closes #164

matthewwardrop commented 6 months ago

What sort of tests would you want to see here? Given the very pluggable nature of formulaic here, this feature ended up being very little code.

Aye... I'd be content with 100% coverage here. There are pretty few moving parts, and so long as every line is tested, and we also add a test for series of non-string types, I'm happy with this as is.

rishi-kulkarni commented 6 months ago

@matthewwardrop Thanks for the comments, they should now be addressed. I went with casting to numpy.str_ since that seemed to have some performance benefits for very large numbers of rows.

matthewwardrop commented 6 months ago

I made one tiny further change to the docstring that I left in the review, but somehow it didn't propagate through. I'll merge this in now. Thank you so much for your contribution!

rishi-kulkarni commented 6 months ago

Hmm, looks like the usedforsecurity flag wasn't added until 3.9. I assume something like:

def md5_to_int(s: str) -> int:
    # Check the Python version
    if sys.version_info >= (3, 9):
        return int(md5(s.encode(), usedforsecurity=False).hexdigest(), 16)
    else:
        return int(md5(s.encode()).hexdigest(), 16)

should work

matthewwardrop commented 6 months ago

Forgot to run the tests! Anyway, I've patched these and some other issues on main. Thanks again for you contribution.