go2starr / lshhdc

LSH based high dimensional clustering for sets and points
MIT License
79 stars 42 forks source link

Some questions #1

Open cjauvin opened 10 years ago

cjauvin commented 10 years ago

Hi! Thank you for this code, I've been studying it thoroughly and it is a very useful and helpful companion to the theory and algorithm sketches found in the MMD book. I have a few questions about some aspects of simhashing theory that are not entirely clear to me, would it be all right if I asked them here?

go2starr commented 10 years ago

Any questions other than the issues you've already brought up?

cjauvin commented 10 years ago

Actually I had some higher-level questions that I finally asked on a Q/A site:

http://stats.stackexchange.com/questions/81930/clarification-needed-about-min-sim-hashing-lsh

@escherba already provided some interesting answers and suggestions, but if you could join the discussion, that would be great.

cjauvin commented 10 years ago

Another thing, about the hash function generator that you're using: it makes sense and seems to work great, but I was wondering if it produces "random enough" permutations. I don't know about the statistical properties of Python's default hash function, as well as its behavior with sequential (1 to n) salting, but I observed some strange things in my results, and it made me suspicious. It seems that some other schemes could be used, like this one for instance:

http://stackoverflow.com/a/2255636/787842

Or perhaps something based on stronger hash functions, like the ones in the hashlib module? As always @escherba your input would be nice.

go2starr commented 10 years ago

I can't comment on your Q&A post since it's farther than I went with this algorithm :)

As far as the hashing goes, you are absolutely right; AFAIK the hash() function doesn't have any statistical guarantees. I think I added the salts as a hack to get around this. We should use SHA or MD5 instead. Do you have an opinion which would be a better choice?

I think it would be best to set a reasonable default, and make the hashing function an optional parameter so the user can override.