Open Gaglia88 opened 7 years ago
The process is right. I haven't read your code in detail but on a quick scan it looks right. The similarity here is fairly low (I'm used to trying to match jaccard similarities closer to the 0.8 or 0.9 range) but you do have the threshold set quite low as well, so it should work. Have you:
Hi, thanks for your reply. I was not really sure that the process that I have implemented was right.
I tested the algorithm with two identical profiles, and they map in the same buckets! Thanks for the hint.
So I think I have made an error in the process of mapping (profile, [buckets]) to (bucket, [profile]) I will check it.
EDIT: The problem was the last flatmap, I don't know why but it didn't work properly. I solved in this way
val bucketsWithProfiles = profileWithBuckets.map { x => x._2.map((_, x._1)) }.flatten.groupBy(x => x._1).map(x => (x._1, x._2.map(_._2)))
Hi, I'm trying to use MinHasher to compute LSH using differents profiles, each profiles has a list of tokens (a bag of words). If I understand right the process to do is:
Is this process correct?
Because I tried it, but it never produces colliding buckets, I mean each bucket contains always a single profile.
This is my testing code
And this is the output