Open graus opened 10 years ago
Is BerkeleyDB the problem or the current format? BDB is quite fast.
It's the BDB version that WPM uses: https://github.com/semanticize/semanticizer/issues/14#issuecomment-20591381 and Oracle BerkeleyDB FAQ.
Gotten a bit further with this. I'm able to convert the DBs from Java Edition to regular one. Next hurdle, seems to be using more complicated formats for the values. Strings are in UTF-8, but rest? If anyone wants to have a go: /scratch/dodijk/BerkeleyDB on zookst13.
In [1]: import bsddb, codecs
In [2]: db = bsddb.btopen("nlwiki-20111104-label.db")
In [3]: db.get(codecs.encode(u'Gro\xdf Vahlberg\x00', 'utf-8'))
Out[3]: '\x01\x01\x02\x02\x01\x8d\n\xe9\xa4\x01\x01\x00\x00'
How's this task different than #14, @graus?
The thought was to split it into removing Bdb dependency (as in definitions, relatedness) vs wpm's csv dependency (all other stuff), but think I failed in stating it as such ;-).
Currently we get from WPM's BerkeleyDB:
We want to replace it with full articles (from Wikipedia XML?), and our own implementation of relatedness calculation (not very complex)