kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

Morphlines hashDigest command #443

Closed tmgstevens closed 8 years ago

tmgstevens commented 8 years ago

Initial commit for hashCommand implementation and tests

tmgstevens commented 8 years ago

Link to https://issues.cloudera.org/browse/KITE-1122

whoschek commented 8 years ago

What's the probability of hash collision (two different messages returning the same id) with the MessageDigests vs. GenerateUUID impl? How much less likely is a collision with SHA-256 than with that we have already today?

tmgstevens commented 8 years ago

Regarding comparison vs GenerateUUID - the latter is guaranteed to be unique regardless of the input document - i.e. running the same document through a billion times will give a billion UUIDs.

This new function will always return the same result for the same input - hence its desirability.

Whether a collision is likely depends on the algorithm chosen and the number of inputs, but casually speaking is close to zero :-)

tmgstevens commented 8 years ago

Thanks for reviewing @whoschek - I've updated based on your feedback.

whoschek commented 8 years ago

Also, let's rename the name of the test case class to HashDigestTest in order to reflect the "hashDigest" name.

whoschek commented 8 years ago

Thanks again for the great contrib! I think we're almost there. After the final minor outstanding changes please squash the multiple commits into a single big commit, and then we're ready to merge.

whoschek commented 8 years ago

Also would be good to add corresponding docs in kite-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence but that can be done in a separate commit if you like.

whoschek commented 8 years ago

To read the html output that is generated from the confluence docs, or for debugging: cd kite; mvn post-site -pl kite-morphlines; open kite-morphlines/target/site/index.html

whoschek commented 8 years ago

Also, for better perf consider using UTF16 instead of UTF8 for string conversion.

tmgstevens commented 8 years ago

Regarding using UTF16 rather than UTF8 - this would give different results as the input byte array would be different. I think in general people would expect UTF8 (for example http://www.miraclesalad.com/webtools/md5.php) however actually we can make that configurable fairly simply. I'll set the default to UTF-8 but add a parameter called charset as per readCSV command.

tmgstevens commented 8 years ago

Thanks for reviewing @whoschek - much appreciated.

tmgstevens commented 8 years ago

Updated the docs onto the same commit. Let me know if there's any further changes required.

whoschek commented 8 years ago

Hi Tristan. I merged this. Thanks for this great contrib!

tmgstevens commented 8 years ago

Thanks Wolfgang for your feedback and swift turnaround. Tristan

Tristan Stevens Senior Solutions Architect Cloudera, Inc. | www.cloudera.com

m +44(0)7808 986422 |

On 27 Apr 2016 8:18 p.m., "Wolfgang Hoschek" notifications@github.com wrote:

Hi Tristan. I merged this. Thanks for this great contrib!

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/kite-sdk/kite/pull/443#issuecomment-215198599