Open nickvosk opened 9 years ago
wikipedia_id1
but it is never linked anywhere, then it wont appear in the model that we generate.
My rough guess is that many of the identifiers that are not in the model are due to this.
I have a ToDo to address this. I want to substitute one of the mentions within wikipedia_id1
's article, sort of to create a fake link
. (Again that substitution has to be above the previous threshold issue )Thanks, this is definitely an important issue to address
Yes, the lack of explicit links is definitely one of the problems. I think that doing entity linking inside each article might lead to better coverage (by restricting the candidate entities to the ones that are already linked inside the article). This might lead to some false positives in some cases though.
Also, the first phrase in a wiki article does not have an explicit link but it could be linked to the id of the article without much risk . :)
I was thinking on something like this:
Barack_Obama
then later on in some of the paragrams either obama
or barack
will be mentioned. just assume that mention is for Barack_Obama
and create a few fake links. Whether correct or incorrect, probably the context will still make sense as it is within the article of the entity, and probably the context is much better than by creating it at the beginning.I think that indeed the first line needs some preprocessing to solve these issues, but I don't think that the vectors are gonna be polluted by adding the first line, as it usually contains quite useful context :)
Yes, we are describing almost the same thing for intra-article entity linking :) I am proposing that you can even expand that logic to every link in the article (by considering their corresponding mentions). It would be interesting to create a small collection to evaluate this.
got it. well, yeah, considering how little coverage of ids this currently has, it is worth going for it.
@nickvosk A good remark that we could use here might be [1]. As the surface forms referring to the article's entity are usually on Bold just like the PR there suggested, it seems to be a more informed assumption.
[1] https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/356
Edit: Updated wrong link
Can you elaborate on how this would fix the coverage problem @dav009 ?
Also, this paper looks relevant : Noraset, Thanapon, Chandra Bhagavatula, and Doug Downey. Adding High-Precision Links to Wikipedia.
@nickvosk :) good reference, I think I saw it before on ACL. Well, It would help to find the right anchors to create 'fake links" of a topic within its own article.
AS the paper suggest we could as well run some NEL, with some very high confidence values to add some extra links and probably get above the min-threshold imposed by the implementation of gensim's word2vec
@dav009 exactly :)
Looking at some old raw counts via dbpedia spotlight project it seems that out of 6M topics in those counts 4M have less than 5 links.
Surpringly filtering topics with more than 50 links give us: 268836, which is similar to our current coverage: 226319.
Hi @dav009, very promising work here!
I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.
Out of
4342357
articles, only226319
had a matching vector (~5%
). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.
Thanks.