semanticize / semanticizer

Entity Linking for the masses
http://semanticize.uva.nl/
GNU General Public License v3.0
56 stars 15 forks source link

Get rid of WPM's BerkeleyDB dependency #23

Open graus opened 10 years ago

graus commented 10 years ago

Currently we get from WPM's BerkeleyDB:

We want to replace it with full articles (from Wikipedia XML?), and our own implementation of relatedness calculation (not very complex)

larsmans commented 10 years ago

Is BerkeleyDB the problem or the current format? BDB is quite fast.

dodijk commented 10 years ago

It's the BDB version that WPM uses: https://github.com/semanticize/semanticizer/issues/14#issuecomment-20591381 and Oracle BerkeleyDB FAQ.

dodijk commented 10 years ago

Gotten a bit further with this. I'm able to convert the DBs from Java Edition to regular one. Next hurdle, seems to be using more complicated formats for the values. Strings are in UTF-8, but rest? If anyone wants to have a go: /scratch/dodijk/BerkeleyDB on zookst13.

In [1]: import bsddb, codecs
In [2]: db = bsddb.btopen("nlwiki-20111104-label.db")
In [3]: db.get(codecs.encode(u'Gro\xdf Vahlberg\x00', 'utf-8'))
Out[3]: '\x01\x01\x02\x02\x01\x8d\n\xe9\xa4\x01\x01\x00\x00'
dodijk commented 10 years ago

How's this task different than #14, @graus?

graus commented 10 years ago

The thought was to split it into removing Bdb dependency (as in definitions, relatedness) vs wpm's csv dependency (all other stuff), but think I failed in stating it as such ;-).