Closed ehsansherkat closed 8 years ago
Hi Ehsan,
It is somehow expected.
I assume you are using gensim's accuracy
function? if so there are a few things to take into account:
Nonetheless the family set got 0/272 but trying:
In [100]: model.most_similar( positive=["brother", "girl"], negative=["boy"], topn=100)
Out[100]:
[(u'sister', 0.6596153974533081),..... ]
gets sister
as the most similar. can you take a futher look at other pairs?
I wonder if it is that u'sister'
(unicode)
Thanks for your response. So do you have any test dataset which could be used for your trained vectors?
I think this might be a bug in gensim due to comparing strings to unicode strings? It sounds like it cause trying different samples from the family dataset lead to the correct answer.
Can you code the evaluation yourself? it is simple enough as splitting each line of questions.txt
and doing :
split_line[1] + split_line[2] - split_line[0] = split_line[3]
As mentioned in Radim's word2vec tutorial doing well or bad in this task means very little, thus you should come up with an evaluation of the model within the domain of the problem you are trying to solve
it seems to be restrict_vocab
https://github.com/piskvorky/gensim/blob/master/gensim/models/word2vec.py#L1414
try setting up to a higher value
I set it to 30000 and got the better results
2016-05-24 09:48:11,584 : INFO : capital-common-countries: 50.0% (1/2) 2016-05-24 09:48:11,684 : INFO : capital-world: 100.0% (1/1) 2016-05-24 09:48:14,640 : INFO : currency: 27.5% (11/40) 2016-05-24 09:48:51,728 : INFO : family: 19.8% (100/506) 2016-05-24 09:50:04,460 : INFO : gram1-adjective-to-adverb: 7.2% (71/992) 2016-05-24 09:51:04,384 : INFO : gram2-opposite: 13.5% (110/812) 2016-05-24 09:52:41,778 : INFO : gram3-comparative: 29.1% (387/1332) 2016-05-24 09:53:55,949 : INFO : gram4-superlative: 12.9% (128/992) 2016-05-24 09:55:13,975 : INFO : gram5-present-participle: 23.8% (251/1056) 2016-05-24 09:55:17,852 : INFO : gram6-nationality-adjective: 11.3% (6/53) 2016-05-24 09:57:13,023 : INFO : gram7-past-tense: 18.3% (286/1560) 2016-05-24 09:58:51,984 : INFO : gram8-plural: 13.7% (182/1332) 2016-05-24 09:59:56,618 : INFO : gram9-plural-verbs: 21.3% (185/870) 2016-05-24 09:59:56,619 : INFO : total: 18.0% (1719/9548)
for some reason that accuracy
method is extremely slow, I guess if you still want to get real numbers of this you can just write a for-loop wrapping most_similar
and the pairs of questions.txt
, then look in the topn results. Im pretty sure those numbers will be pushed even higher.
Hi David,
I have tested the accuracy of your pre-trained model with "questions-words.txt" dataset from Google. The results are:
2016-05-23 12:20:41,450 : INFO : family: 0.0% (0/272) 2016-05-23 12:20:45,831 : INFO : gram1-adjective-to-adverb: 3.8% (23/600) 2016-05-23 12:20:47,155 : INFO : gram2-opposite: 14.3% (26/182) 2016-05-23 12:20:53,058 : INFO : gram3-comparative: 0.0% (0/812) 2016-05-23 12:20:55,041 : INFO : gram4-superlative: 5.5% (15/272) 2016-05-23 12:21:00,144 : INFO : gram5-present-participle: 0.0% (0/702) 2016-05-23 12:21:08,831 : INFO : gram7-past-tense: 4.1% (49/1190) 2016-05-23 12:21:14,729 : INFO : gram8-plural: 0.0% (0/812) 2016-05-23 12:21:18,418 : INFO : gram9-plural-verbs: 3.6% (18/507) 2016-05-23 12:21:18,419 : INFO : total: 2.4% (131/5349)
The accuracy is quite low (2.4%) is that normal? Thanks