publicscience / argos

Know more with less
Other
50 stars 1 forks source link

use dbpedia wikipedia pagelinks to assess the "commonness" of a concept #129

Closed frnsys closed 10 years ago

frnsys commented 10 years ago

http://wiki.dbpedia.org/Downloads39#wikipedia-pagelinks (NB: extracted, the TTL dump is over 26GB in size. may need to further bump up the knowledge instance's disk space)

this may be a pretty rough/inaccurate heuristic but intuitively it makes sense. the more pagelinks going to a concept, the more common it is assumed to be, thus it is assumed to be known by more people/a more obvious concept, and thus its ranking can be weighted downwards. we want to surface concepts which less people know about.

a bit of experimentation may be required to see if this assumption holds up. for example, how many pagelinks are there to "Paris"? I expect that it's significantly higher than "François Hollande" but it may not be.

i don't expect that this value will change significantly very quickly so we could "cache" this commonness value on the concept itself (i.e. as a commonness property) so the knowledge DB doesn't need to be hit for it during every ranking.

in fact, depending on the ranking algorithm we end up using, we could just have a base_ranking or base_score which incorporates this commonness metric and this is the starting point for ranking concepts against the events which they are mentioned in.