Open cjauvin opened 10 years ago
Any questions other than the issues you've already brought up?
Actually I had some higher-level questions that I finally asked on a Q/A site:
http://stats.stackexchange.com/questions/81930/clarification-needed-about-min-sim-hashing-lsh
@escherba already provided some interesting answers and suggestions, but if you could join the discussion, that would be great.
Another thing, about the hash function generator that you're using: it makes sense and seems to work great, but I was wondering if it produces "random enough" permutations. I don't know about the statistical properties of Python's default hash
function, as well as its behavior with sequential (1 to n) salting, but I observed some strange things in my results, and it made me suspicious. It seems that some other schemes could be used, like this one for instance:
http://stackoverflow.com/a/2255636/787842
Or perhaps something based on stronger hash functions, like the ones in the hashlib
module? As always @escherba your input would be nice.
I can't comment on your Q&A post since it's farther than I went with this algorithm :)
As far as the hashing goes, you are absolutely right; AFAIK the hash() function doesn't have any statistical guarantees. I think I added the salts as a hack to get around this. We should use SHA or MD5 instead. Do you have an opinion which would be a better choice?
I think it would be best to set a reasonable default, and make the hashing function an optional parameter so the user can override.
Hi! Thank you for this code, I've been studying it thoroughly and it is a very useful and helpful companion to the theory and algorithm sketches found in the MMD book. I have a few questions about some aspects of simhashing theory that are not entirely clear to me, would it be all right if I asked them here?