plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

most_similar() anomalies #23

Closed mjmartindale closed 6 years ago

mjmartindale commented 6 years ago

Hi, I converted bilingual FastText embeddings into a medium magnitude model and I'm getting some questionable results:

>>> xlvecs=Magnitude("wiki.+de+en.tag.vec.magnitude") >>> katze=xlvecs.most_similar("katze@de@", topn=5) >>> print(katze) [('rabbit,@en@', 0.3190704584121704), ('dogs,@en@', 0.31559139490127563), ('chickenhound@en@', 0.3059767484664917), ('rabbity@en@', 0.30381107330322266), ('#mouse@en@', 0.29921069741249084)] >>> xlvecs.similarity("katze@de@", "cat@en@") 0.4569693 >>> xlvecs.similarity("katze@de@", "cats@en@") 0.38769498 >>> xlvecs.similarity("katze@de@", "dog@en@") 0.42773518 >>> xlvecs.similarity("katze@de@", "rabbit@en@") 0.40975133

"cat@en@", "cats@en@", "dog@en@" and even actual "rabbit@en@" (no spurious comma) are more similar to "katze@de@" but instead I'm getting "rabbits,@en@". Am I misunderstanding what most_similar is supposed to do?

I thought maybe I could try setting max_distance to just a hair above xlvecs.distance("katze@de@", "cat@en@) to see what would happen, but I got TypeError: most_similar() got an unexpected keyword argument 'max_distance'

I'm on version 0.1.48

AjayP13 commented 6 years ago

Hi,

Thanks for reporting this. You're right the documentation was wrong, the argument is notmax_distance the argument is min_similarity (takes in values from -1.0-1.0), but it's a similar concept. I've updated the documentation.

We do have tests for most_similar, but that does appear to be broken and I'm not sure why off the top of my head. Would you mind sending me the .magnitude file you have somehow?

Seems like these services might help you transfer large files: https://wetransfer.com/ (2GB) https://transfer.pcloud.com/ (5GB)

If you need an e-mail you can use opensource@plasticity.ai.

mjmartindale commented 6 years ago

Thanks for updating the documentation! min_similarity makes much more sense :)

I'm attempting to send the gzipped file via pcloud right now (slow upload, but hopefully it should get there eventually)

AjayP13 commented 6 years ago

Thanks, received! I'll investigate what's going on with this file and report back here.

AjayP13 commented 6 years ago

There was a small bug in the heap data structure I was using for vocabularies of greater than 3,000,000 (the batch size in which most similar is calculated) so the it only manifested itself in a file like yours where there was > 3,000,000 words. I added test cases to test this now.

This is now fixed on 0.1.50 (you'll need to update your version of Magnitude) and I've confirmed it with your file and the results much more correct now with the similarity values being closer to .8 rather than .3: [(u'katzen@de@', 0.7985203), (u'hauskatze@de@', 0.77978325), (u'hund@de@', 0.7567065), (u'miezekatze@de@', 0.73003185), (u'gl\xfcckskatze@de@', 0.72136724), (u'nachbarskatze@de@', 0.71924585), (u'grinsekatze@de@', 0.7185732), (u'pudelkatze@de@', 0.7160362), (u'katzenohren@de@', 0.70661163), (u'kaninchen@de@', 0.7043324)]

Thanks again for reporting it!