Wikipedia articles coverage

nickvosk commented 9 years ago

Hi @dav009, very promising work here!

I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.

Out of 4342357 articles, only 226319 had a matching vector (~5%). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.

Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.

Thanks.

dav009 commented 9 years ago

yes, I think one reason for losing some wikipedia identifiers is definitely the min threshold that had to be used. It seems that deeplearning4j fixed many of the problems that they initially had, so I have to give a try to runnin it with deeplearning4j and a lower min threshold. Another way could be "manipulating" the corpus to assure wikipedia identifiers are above that threshold.
There is another reason: be aware that as it currently stands, it only adds wikipedia identifiers which have explicit links within the wikipedia corpus. i.e: suppose you have an article with an id wikipedia_id1 but it is never linked anywhere, then it wont appear in the model that we generate. My rough guess is that many of the identifiers that are not in the model are due to this. I have a ToDo to address this. I want to substitute one of the mentions within wikipedia_id1's article, sort of to create a fake link. (Again that substitution has to be above the previous threshold issue )
then it comes a second part of resolving redirects..

Thanks, this is definitely an important issue to address

nickvosk commented 9 years ago

Yes, the lack of explicit links is definitely one of the problems. I think that doing entity linking inside each article might lead to better coverage (by restricting the candidate entities to the ones that are already linked inside the article). This might lead to some false positives in some cases though.

Also, the first phrase in a wiki article does not have an explicit link but it could be linked to the id of the article without much risk . :)

dav009 commented 9 years ago

Creating a fake link at the beginning of the article could be an alrternative, but I kinda don't like it because then the generated vector will be poluted with dates,locations that are usually aggregated in the first line of a wikipedia article (i.e the pronounciation)

I was thinking on something like this:

suppose you have the article Barack_Obama then later on in some of the paragrams either obama or barack will be mentioned. just assume that mention is for Barack_Obama and create a few fake links. Whether correct or incorrect, probably the context will still make sense as it is within the article of the entity, and probably the context is much better than by creating it at the beginning.

nickvosk commented 9 years ago

I think that indeed the first line needs some preprocessing to solve these issues, but I don't think that the vectors are gonna be polluted by adding the first line, as it usually contains quite useful context :)

Yes, we are describing almost the same thing for intra-article entity linking :) I am proposing that you can even expand that logic to every link in the article (by considering their corresponding mentions). It would be interesting to create a small collection to evaluate this.

dav009 commented 9 years ago

got it. well, yeah, considering how little coverage of ids this currently has, it is worth going for it.

dav009 commented 9 years ago

@nickvosk A good remark that we could use here might be [1]. As the surface forms referring to the article's entity are usually on Bold just like the PR there suggested, it seems to be a more informed assumption.

[1] https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/356

Edit: Updated wrong link

nickvosk commented 9 years ago

Can you elaborate on how this would fix the coverage problem @dav009 ?

Also, this paper looks relevant : Noraset, Thanapon, Chandra Bhagavatula, and Doug Downey. Adding High-Precision Links to Wikipedia.

dav009 commented 9 years ago

@nickvosk :) good reference, I think I saw it before on ACL. Well, It would help to find the right anchors to create 'fake links" of a topic within its own article.

AS the paper suggest we could as well run some NEL, with some very high confidence values to add some extra links and probably get above the min-threshold imposed by the implementation of gensim's word2vec

nickvosk commented 9 years ago

@dav009 exactly :)

dav009 commented 9 years ago

Looking at some old raw counts via dbpedia spotlight project it seems that out of 6M topics in those counts 4M have less than 5 links.

Surpringly filtering topics with more than 50 links give us: 268836, which is similar to our current coverage: 226319.

idio / wiki2vec

Wikipedia articles coverage #4