kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
249 stars 24 forks source link

Test and possibly migrate to lmdbjava #75

Open kermitt2 opened 6 years ago

kermitt2 commented 6 years ago

lmdbjava is apparently better maintained (more features & more OS built) and faster... also never get the zero copy mode working reliably with lmdbjni so it is worth trying lmdbjava for this too.

lfoppiano commented 6 years ago

Everything seems to work properly except one thing, the number of readers is limited and asynchronous calls from the javascript frontend results in exceptions. https://github.com/lmdbjava/lmdbjava/issues/65#issuecomment-386505879

lfoppiano commented 6 years ago

The commit 7372366 should have solved the last issue with the number of readers.

lfoppiano commented 6 years ago
KB: 37413613 concepts.
EN: 14899737 pages.
DE: 3579552 pages.
FR: 3681264 pages.
ES: 3322291 pages.
IT: 2291751 pages.
lfoppiano commented 6 years ago

For languages other than english the domains are not resolved. Not a clue why.

kermitt2 commented 6 years ago

It's because the domains are derived from the English categories only, the other languages first do not have the same category hierarchy (then we would need a mapping per language) and have a much small set of categories.

lfoppiano commented 6 years ago

I wasn't clear. I wanted to say that with this branch there there are no domains at all, while on the master version the domains are in the output json.

Might be solved by rebuilding again all the databases?

kermitt2 commented 6 years ago

Mmm I dont understand. The domains are not produced for English or the domains are not in the disambiguation result json?

For the domains, they are built one time by the Upper KB and they are build just after building the Lower KB for English. It's like any db, if you want to force it to be rebuild, just delete the lmdb files and relaunch.

lfoppiano commented 6 years ago

They are not in the output json. But it was just a note on the task.

lfoppiano commented 6 years ago

before lmdbjava:

screen shot 2018-06-18 at 15 52 25

after lmdbjava:

screen shot 2018-06-18 at 15 52 32
lfoppiano commented 6 years ago

Interesting thing is that the total number of concepts and pages correspond (see https://github.com/kermitt2/entity-fishing/issues/50)

kermitt2 commented 6 years ago

might be that the interlingual files are missing in your resource files, so the KB cannot relate the English domain to an Italian entity