Similarity results are strange: Many words has high similarity

nzw0301 commented 6 years ago

python -m pytorch_skipgram.main --input=data/text8 --epoch=5 --out=text8.vec --min-count=10 --sample=1e-3 --batch=20 --negative=5 --gpu-id -1 --loss nce

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(fname="text8.vec")

for w, s in model.most_similar(positive=["king"], topn=10):
    print(w, s)

macedon 0.8969985246658325
philip 0.8820368051528931
epirus 0.8735938668251038
baldwin 0.8733534216880798
afonso 0.8701669573783875
iv 0.8694380521774292
reigning 0.8647789359092712
andronicus 0.8641448020935059
castile 0.8617050647735596
alfonso 0.8605258464813232

Results are similar to the result above if loss is neg and batch size changes, ex. 10 or 512.

nzw0301 commented 6 years ago

Original code's results (and training parameters).

./word2vec -train ../text8 -output vectors.txt -cbow 0 -size 128 -window 5 -negative 15 -hs 0 -sample 1e-4 -threads 20 -iter 5

Starting training using file ../text8

3M
Vocab size: 71291
Words in train file: 16718843

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(fname="vectors.txt")

# similarity
for w, s in model.most_similar(positive=["king"], topn=10):
    print(w, s)

kings 0.738884687423706
canute 0.7192916870117188
pretender 0.7100580930709839
lulach 0.7061632871627808
throne 0.7048152685165405
sweyn 0.7030582427978516
uzziah 0.6995832920074463
haakon 0.6954260468482971
jehoahaz 0.6949001550674438
rehoboam 0.6932476758956909

# analogy
for w, s in model.most_similar(positive=["king", "woman"], negative=["man"], topn=10):
    print(w, s)

queen 0.6600189208984375
isabella 0.585029125213623
pharaoh 0.5797321200370789
daughter 0.5795919299125671
consort 0.5724056959152222
philippa 0.5536977648735046
betrothed 0.5489968657493591
throne 0.5481375455856323
wife 0.5443253517150879
princess 0.5438390970230103

nzw0301 commented 6 years ago

minibatch size = 1

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(fname="text8.vec")

for w, s in model.most_similar(positive=["king"], topn=10):
    print(w, s)

afonso 0.9248461723327637
philip 0.9101537466049194
portugal 0.9048298001289368
succeeded 0.9026679992675781
darius 0.9015048742294312
ruler 0.9007291197776794
castile 0.9001737833023071
macedon 0.9001021981239319
epirus 0.8928771615028381
shalmaneser 0.8904857635498047

Do training parts seems to be wrong?

nzw0301 commented 5 years ago

Update: 964476121f65933d9df6fc1723c08b921d608d0c

Become better, but similarity scores are still too high...

Trainining parameters:

python -m pytorch_skipgram.main --input=data/text8 --epoch=1 --out=text8.vec --min-count=5 --sample=1e-5 --batch=1 --negative=10 --gpu-id -1

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(fname="text8.vec")

for w, s in model.most_similar(positive=["king"], topn=10):
    print(w, s)

emperor 0.9891945123672485
pope 0.9840819239616394
constantine 0.973068356513977
queen 0.9724664688110352
augustus 0.9719542264938354
bishop 0.9710444211959839
prince 0.970699667930603
iv 0.9689308404922485
reign 0.9665072560310364
vi 0.9662578105926514

nzw0301 commented 5 years ago

The results seems fine.

python -m pytorch_skipgram.main --input=data/text8 --dim=128 --epoch=5 --out=text8.vec --min-count=5 --sample=1e-4 --batch=16 --negative=15 --gpu-id -1

# similarity
for w, s in model.most_similar(positive=["king"], topn=10):
    print(w, s)

canute 0.7516068816184998
sweyn 0.7161520719528198
haakon 0.715397298336029
plantagenet 0.7071711421012878
kings 0.7037447094917297
valdemar 0.703365683555603
omri 0.699432373046875
capet 0.6928986310958862
conqueror 0.6921138763427734
eochaid 0.690447986125946

#analogy
for w, s in model.most_similar(positive=["king", "woman"], negative=["man"], topn=10):
    print(w, s)

queen 0.649447500705719
daughter 0.6051150560379028
anjou 0.6023151874542236
consort 0.595568060874939
son 0.5846152305603027
marries 0.5731959342956543
aquitaine 0.5700898170471191
isabella 0.568467378616333
infanta 0.5641375780105591
princess 0.5628763437271118

nzw0301 / pytorch_skipgram

Similarity results are strange: Many words has high similarity #6