modelop / hadrian

Implementations of the Portable Format for Analytics (PFA)
Apache License 2.0
130 stars 49 forks source link

Bitwise operations, hashing #22

Closed nrpeterson closed 7 years ago

nrpeterson commented 7 years ago

Hey folks,

A large number of machine learning applications require hashing methods. As far as I can tell, there are no hashing algorithms built into PFA; but, I'm more than happy to handle that on my end, as a user-defined function.

Hashing methods, however, rely heavily on bit-based operations. The PFA spec already includes bitwise logic operations (&, |, ^, ~); but, it doesn't contain shift operators.

So, I guess my question is: 1) Are there any plans to add primitives for common hashing algorithms? 2) If not, are there any plans to add left and right bit shift operators?

Thanks!

jpivarski commented 7 years ago

Considering that PFA has explicitly sized integers, shift operators in PFA would be more reasonable than having them in Python (which it does). I think the reason that PFA doesn't have the shifts is because they're equivalent to multiplication and division by powers of 2, so it's just a performance thing. Performance things are supposed to be discovered and implemented by the PFA execution engine.

It would be more in line with PFA's focus to add common hashing algorithms as functions, rather than shift operators, so that work-alikes can be optimized in different vendors' implementations. This is the level of abstraction where PFA is supposed to operate.

New functions get decided upon in the DMG PFA working group; do you want to make a proposal? I think the membership is open, so you could join us and make your suggestion in person, or if you want to give it to me, I'll bring it up. I'd need a specification, though: the set of hashing functions you think PFA should have.

nrpeterson commented 7 years ago

Thanks, @jpivarski ! I'd be happy to discuss the possibility of contributing; it would just depend on the level of obligation, of course.

As for the issue at hand: the big one that I see a need for is MurmurHash3, as this is used both by scikit-learn and by the Spark ML libraries. I can do a little digging to see if there are any other commonly used ones.

MLnick commented 7 years ago

Hi there - I would like to see MurmurHash3 added as the first built-in hash function if possible. @jpivarski let me know what the procedure is to do that via the Working Group?

jpivarski commented 7 years ago

Attend the next PFA working group (ask Walt Wells, walt@dmg.org for the time) and make the proposal there. The approval process will go faster if you include an implementation that everyone can view.

Thanks, Jim

On Jun 7, 2017 4:43 AM, "Nick Pentreath" notifications@github.com wrote:

Hi there - I would like to see MurmurHash3 added as the first built-in hash function if possible. @jpivarski https://github.com/jpivarski let me know what the procedure is to do that via the Working Group?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opendatagroup/hadrian/issues/22#issuecomment-306745340, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxEHwDrgV62S0tbUAvJmsTiO9L4pVGSks5sBnC3gaJpZM4KFH6j .

MLnick commented 7 years ago

Ok - by implementation do you just mean a reference impl? The Scala std lib has one (and most other languages have a standard impl).

jpivarski commented 7 years ago

The implementation can involve dependencies, but it has to be well-described and straightforward enough to reimplement on any platform, any language. This is why Hadrian and Titus are using external libraries to implement POSIX regular expressions, rather than Java regular expressions in Java and Python regular expressions in Python. The hash algorithm has to be exactly reproducible, and it might be easier to do that with a reimplementation (so that the code can be ported) or not.

Jim

On Jun 7, 2017 5:16 AM, "Nick Pentreath" notifications@github.com wrote:

Ok - by implementation do you just mean a reference impl? The Scala std lib has one (and most other languages have a standard impl).

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/opendatagroup/hadrian/issues/22#issuecomment-306752784, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxEH6aX--5xM_cUZbUWeGhcNo4PSrYyks5sBnhkgaJpZM4KFH6j .

MLnick commented 7 years ago

@nrpeterson did you ever implement a version in PFA (if so could you share it perhaps?)