kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
234 stars 25 forks source link

Find a way to reduce the size of the wikidata knowledge #162

Open oterrier opened 1 month ago

oterrier commented 1 month ago

The wikidata knowledge base is growing more and more and sometimes with concepts that are not of interest according to some use cases (press articles for example) The https://www.wikidata.org/wiki/Wikidata:Statistics page shows for example that some objects are over represented like scholarly article: 22,574,314 (31.5%) astronomical object: 4,601,733 (6.4%) taxon: 2,726,046 (3.8%) etc... Could we find a mecanism to filter out some well defined P31 or P279 concepts to reduce the size of the lmdb database, either at creation or at loading ?

Would be a great help for some of our customers

Thx

Olivier

oterrier commented 1 month ago

See also comment here about a possible implementation https://github.com/kermitt2/entity-fishing/issues/105#issuecomment-713344980