nicoDs96 / Document-Similarity-using-Python-and-PySpark

Document Similarity with Apache Spark using Locality Sesitive Hashing and Python
https://nicods96.github.io/hi//fast-document-similarity-in-python/
7 stars 2 forks source link

Cryptographic hash function algorithm #1

Closed MTanasan closed 2 years ago

MTanasan commented 2 years ago

Hi, Your work has been wonderful. Can I ask you to send me the article on the Cryptographic hash function algorithm**** that you use to implement the project?

nicoDs96 commented 2 years ago

Hi, thank you very much I am really happy you appreciate the project. About the crypto-hash, as you can see at LSH folder in chapter HashFamily it is a simple sha-1. The definition of sha 1 is a NIST publication. I am not sure however if I understood your question and you actually need the sha1 paper or if you need something about applicability of crypto hash in the similarity search. If the latter is what you are asking for, there are no constraints on the hash functions you can use, crypto hash are only slower but less likely to generate collisions, even if for Sha1 collisions have been found

MTanasan commented 2 years ago

I want to find out that does sha-1 hash coding maintains similarity in the new space?

nicoDs96 commented 2 years ago

@MTanasan ok now I got it, thank you for the patience. There is a theorem stating that the a minhash function for two sets equals the Jaccard similarity for those sets under certain conditions (or something like that, it is quite a bit I don't touch this stuff, don't blame on me if i am not precise). The minhash technique is independent from the family of hash functions used, it is a general technique and it only matters that an hash function is used. I suggest you to start from wikipedia and follow its references for the articles. Also there is a nice blog post about the topic here. Hope this is what you where looking for!