idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Wikipedia articles coverage #4

Open nickvosk opened 9 years ago

nickvosk commented 9 years ago

Hi @dav009, very promising work here!

I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.

Out of 4342357 articles, only 226319 had a matching vector (~5%). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.

Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.

Thanks.

dav009 commented 9 years ago

Thanks, this is definitely an important issue to address

nickvosk commented 9 years ago

Yes, the lack of explicit links is definitely one of the problems. I think that doing entity linking inside each article might lead to better coverage (by restricting the candidate entities to the ones that are already linked inside the article). This might lead to some false positives in some cases though.

Also, the first phrase in a wiki article does not have an explicit link but it could be linked to the id of the article without much risk . :)

dav009 commented 9 years ago

I was thinking on something like this:

nickvosk commented 9 years ago

I think that indeed the first line needs some preprocessing to solve these issues, but I don't think that the vectors are gonna be polluted by adding the first line, as it usually contains quite useful context :)

Yes, we are describing almost the same thing for intra-article entity linking :) I am proposing that you can even expand that logic to every link in the article (by considering their corresponding mentions). It would be interesting to create a small collection to evaluate this.

dav009 commented 9 years ago

got it. well, yeah, considering how little coverage of ids this currently has, it is worth going for it.

dav009 commented 9 years ago

@nickvosk A good remark that we could use here might be [1]. As the surface forms referring to the article's entity are usually on Bold just like the PR there suggested, it seems to be a more informed assumption.

[1] https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/356

Edit: Updated wrong link

nickvosk commented 9 years ago

Can you elaborate on how this would fix the coverage problem @dav009 ?

Also, this paper looks relevant : Noraset, Thanapon, Chandra Bhagavatula, and Doug Downey. Adding High-Precision Links to Wikipedia.

dav009 commented 9 years ago

@nickvosk :) good reference, I think I saw it before on ACL. Well, It would help to find the right anchors to create 'fake links" of a topic within its own article.

AS the paper suggest we could as well run some NEL, with some very high confidence values to add some extra links and probably get above the min-threshold imposed by the implementation of gensim's word2vec

nickvosk commented 9 years ago

@dav009 exactly :)

dav009 commented 9 years ago

Looking at some old raw counts via dbpedia spotlight project it seems that out of 6M topics in those counts 4M have less than 5 links.

Surpringly filtering topics with more than 50 links give us: 268836, which is similar to our current coverage: 226319.