Pre-trained model accuracy

idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby

601 stars 137 forks source link

Pre-trained model accuracy #17

Closed ehsansherkat closed 8 years ago

ehsansherkat commented 8 years ago

Hi David,

I have tested the accuracy of your pre-trained model with "questions-words.txt" dataset from Google. The results are:

2016-05-23 12:20:41,450 : INFO : family: 0.0% (0/272) 2016-05-23 12:20:45,831 : INFO : gram1-adjective-to-adverb: 3.8% (23/600) 2016-05-23 12:20:47,155 : INFO : gram2-opposite: 14.3% (26/182) 2016-05-23 12:20:53,058 : INFO : gram3-comparative: 0.0% (0/812) 2016-05-23 12:20:55,041 : INFO : gram4-superlative: 5.5% (15/272) 2016-05-23 12:21:00,144 : INFO : gram5-present-participle: 0.0% (0/702) 2016-05-23 12:21:08,831 : INFO : gram7-past-tense: 4.1% (49/1190) 2016-05-23 12:21:14,729 : INFO : gram8-plural: 0.0% (0/812) 2016-05-23 12:21:18,418 : INFO : gram9-plural-verbs: 3.6% (18/507) 2016-05-23 12:21:18,419 : INFO : total: 2.4% (131/5349)

The accuracy is quite low (2.4%) is that normal? Thanks

dav009 commented 8 years ago

Hi Ehsan,

It is somehow expected. I assume you are using gensim's accuracy function? if so there are a few things to take into account:

if you use the model I shared I did not cleaned the corpus to a lower-case vocabulary, which is something gensim's accuracy assume by default, that means many tokens might not be found (i.e: athens as opposed to Athens)

Nonetheless the family set got 0/272 but trying:

In [100]: model.most_similar( positive=["brother", "girl"], negative=["boy"], topn=100)
Out[100]:
[(u'sister', 0.6596153974533081),..... ]

gets sister as the most similar. can you take a futher look at other pairs?

dav009 commented 8 years ago

I wonder if it is that u'sister' (unicode)

ehsansherkat commented 8 years ago

Thanks for your response. So do you have any test dataset which could be used for your trained vectors?

dav009 commented 8 years ago

I think this might be a bug in gensim due to comparing strings to unicode strings? It sounds like it cause trying different samples from the family dataset lead to the correct answer.

Can you code the evaluation yourself? it is simple enough as splitting each line of questions.txt and doing : split_line[1] + split_line[2] - split_line[0] = split_line[3]

As mentioned in Radim's word2vec tutorial doing well or bad in this task means very little, thus you should come up with an evaluation of the model within the domain of the problem you are trying to solve

dav009 commented 8 years ago

it seems to be restrict_vocab https://github.com/piskvorky/gensim/blob/master/gensim/models/word2vec.py#L1414

try setting up to a higher value

ehsansherkat commented 8 years ago

I set it to 30000 and got the better results

2016-05-24 09:48:11,584 : INFO : capital-common-countries: 50.0% (1/2) 2016-05-24 09:48:11,684 : INFO : capital-world: 100.0% (1/1) 2016-05-24 09:48:14,640 : INFO : currency: 27.5% (11/40) 2016-05-24 09:48:51,728 : INFO : family: 19.8% (100/506) 2016-05-24 09:50:04,460 : INFO : gram1-adjective-to-adverb: 7.2% (71/992) 2016-05-24 09:51:04,384 : INFO : gram2-opposite: 13.5% (110/812) 2016-05-24 09:52:41,778 : INFO : gram3-comparative: 29.1% (387/1332) 2016-05-24 09:53:55,949 : INFO : gram4-superlative: 12.9% (128/992) 2016-05-24 09:55:13,975 : INFO : gram5-present-participle: 23.8% (251/1056) 2016-05-24 09:55:17,852 : INFO : gram6-nationality-adjective: 11.3% (6/53) 2016-05-24 09:57:13,023 : INFO : gram7-past-tense: 18.3% (286/1560) 2016-05-24 09:58:51,984 : INFO : gram8-plural: 13.7% (182/1332) 2016-05-24 09:59:56,618 : INFO : gram9-plural-verbs: 21.3% (185/870) 2016-05-24 09:59:56,619 : INFO : total: 18.0% (1719/9548)

dav009 commented 8 years ago

for some reason that accuracy method is extremely slow, I guess if you still want to get real numbers of this you can just write a for-loop wrapping most_similar and the pairs of questions.txt, then look in the topn results. Im pretty sure those numbers will be pushed even higher.