IMPORTANT: No longer in active development.
Idio's Spotlight Model Editor allows you to manually tweak dbpedia spotlight's models. Thus it allows you to manually:
The default branches(Development/Master) work on Spotlight 0.6 models. If you downloaded your model from: http://spotlight.sztaki.hu/downloads/version-0.1/ then it is an 0.6 model.
The branch feature/code-clean-up-0-7 works on Spotlight 0.7 models. If you downloaded your model from: http://spotlight.sztaki.hu/downloads/ then it is an 0.7 model.
In order to use the Model Editor, you will need:
We also recommend using IntelliJ, for editing the code. See below, for instructions on how to set up a project.
We assume that you have the correct versions of Java and mvn in your system.
The language models consume a lot of computational resources, so in these instructions we use the model for
Turkish (located in the tr
folder). Feel free to play with other languages, if you have a big machine.
mvn package appassembler:assemble
sh target/bin/model-editor explore path-to-model/en/model/ 20
it should print the stats for 20 surface forms
Step 3 generates a jar with all the dependencies in target
folder. Then it generates a script with default values for calling the jar. The script calls the jar with default values for the heap (15g). If you want to override this value you can modify: (i) the pom appassembler-maven-plugin
settings in the pom, or (ii) call the jar directly java -xmx.. -jar ...
followed by the commands shown in this readme.
File
>Import Project
-> Select POM Project
Preferences
-> Compiler
and add '-Xmx5G' to 'Aditional VM options',SpotlightModelReader
class, right click Main
and select run scala console
, enjoystart by freeing as much ram as possible.
Each of the following tools addressing a command
refers to calling the jar/script as follows
using the generated script:
sh target/bin/model-editor <command> <subcommand> arg1 arg2
using the generated jar:
java -Xmx15g -Xms15g -jar target/idio-spotlight-model-0.1.0-jar-with-dependencies.jar <command> <subcommand> arg1 arg2
explore
/mnt/share/spotlight/en/model
example:
sh target/bin/model-editor explore path-to-turkish/tr/model/ 40
All topic related actions are carried out using the topic
command followed by one of the following subcommands:
search
: checking if a topic is in the storescheck-context
: printing the context of a topicclean-set-context
: cleaning and setting the context of a topictopic
search
/mnt/share/spotlight/en/model
DbpediaId
in the Model and returns whether that topic exists or not in the modeli.e :
sh target/bin/model-editor topic search path/to/model Michael_Schumacher
topic
check-context
/mnt/share/spotlight/en/model
example:
sh target/bin/model-editor topic check-context /mnt/share/spotlight/en/model Barack_Obama\|United_States
topic
clean-set-context
pathToSpotlightModel/model
each line of the given input file should be like:
dbpediaUri <tab> contextWordsSeparatedByPipe <tab> countsSeparatedByPipe
the size of contextWordsSeparatedByPipe
and countsSeparatedByPipe
should be the same
example:
sh target/bin/model-editor topic clean-set-context /mnt/share/spotlight/en/model folder/fileWithContextChanges
All surface forms related actions are carried out using the surfaceform
command followed by one of the following subcommands:
stats
: printing stats of a surface formcandidates
: printing the list of candidates of a surface formmake-spottable
: making surfaceforms spottablemake-unspottable
: making surfaceforms unspottablecopy-candidates
: adding to a surfaceformA
all candidates of a surfaceFormB
surfaceform
stats
/mnt/share/spotlight/en/model
example :
sh target/bin/model-editor surfaceform stats ~/Downloads/tr/model/ evrimleri
outputs statistics for the surface form evrimleri
surfaceform
candidates
/mnt/share/spotlight/en/model
example :
sh target/bin/model-editor surfaceform candidates ~/Downloads/tr/model/ evrimleri
would check the candidate topics for the surface form evrimleri
surfaceform
make-unspottable
/mnt/share/spotlight/en/model
|
. i.e: how\|How\|Hello\ World
-f
is passed)SF
won't be spottable anymoresh target/bin/model-editor surfaceform make-unspottable path/to/model surfaceForm1\|surfaceForm2\|
sh target/bin/model-editor surfaceform make-unspottable path/to/model pathTo/File/withSF -f
command: surfaceform
subcommand: copy-candidates
arg1: path to dbpedia spotlight model,/mnt/share/spotlight/en/model
arg2: path to file containing pairs of surfaceForm. each line should be :
<originSurfaceForm> <tab> <destinySurfaceForm>
result: copies the candidate topics from each originSurfaceForm
as candidates topics to destinySurfaceForm
example:
sh target/bin/model-editor surfaceform copy-candidates path/to/model pathToFile
surfaceform
make-spottable
/mnt/share/spotlight/en/model
|
. i.e: how\|How\|Hello\ World
-f
is passed)SF
will be spottableexample:
sh target/bin/model-editor surfaceform make-spottable path/to/model surfaceForm1\|surfaceForm2\|
sh target/bin/model-editor surfaceform make-spottable path/to/model pathTo/File/withSF -f
All surface forms related actions are carried out using the association
command followed by one of the following subcommands:
remove
association
remove
pathToSpotlightModel/model
Every line in the input file describes an association which will be deleted, each line should follow the format:
dbpediaURI <tab> Surface Form
example:
sh target/bin/model-editor association remove /mnt/share/spotlight/en/model /path/to/file/file_with_associations
When updating the model with lots of SF
, Topics
and Context Words
best is to do it from a file.
each line of the file should follow the format:
dbpedia_id <tab> surfaceForm1|surfaceForm2... <tab> contextW1|contextW2... <tab> contextW1Counts|ContextW2Counts
Before doing actual changes to the model it might be useful to see how many SF
,dbpedia topics
and links between those two are missing.
sh target/bin/model-editor file-update check path/to/en/model path_to_file/with/model/changes
.
make sure you have enough ram to hold all the models that should be around 15g. do:
sh target/bin/model-editor file-update all path/to/en/model path_to_file/with/model/changes
If you don't have enough ram you can update the SF
and DbpediaTopics
in one step and the Context Words
in other, this will require less memory.
context.mem
to context2.mem
this will avoid the jar to avoid loading the context store
surfaceform store
, resource store
and candidate store
: sh target/bin/model-editor file-update all path/to/en/model path_to_file/with/model/changes
.path_to_file/with/model/changes_just_context
will be generated after running the previous command.This file contains dbpediaIds(internal model's indexes) to contextWords, and it can be processed in the following step.context2.mem
to context.mem
, and rename every other file in the model folder to something else.( if this is not done, the stores will be loaded and they will consume all your ram) context store
:
sh target/bin/model-editor file-update context-only path/to/en/model path_to_file/with/model/changes_just_context
steps 1-4 could be applied while ignoring 5 and 6 when:
SFs
SFs
with already existing Dbpedia Topic
steps 5-6 could be applied while ignoring 1-4 when:
Dbpedia Topic
Important:
step 1-4
will only add SF
and Dbpedia Topics
if they dont exist.step 1-4
will make all specified SF
spottablestep 5-6
Only ADDS context words to the context of a dbpedia Topic.Best way to play the models and modify them is to use the scala console.
JAVA_OPTS="-Xmx15000M -Xms15000M" scala
Once you start a scala console you can use it like ipython
to create instances of the scala classes we have, to load the models, check if dbpedia id's exist, add new dbpedia ids, add new surface forms etc..
do: :cp pathTo/ModelEditor.jar
This will load the classes inside the model editor. After that you should be able to play with the classes inside the jar.
Example:
var spotlightModel = org.idio.dbpedia.spotlight.Main.getSpotlightModel( "/Users/dav009/Downloads/tr/model/")
spotlightModel.showSomeSurfaceForms()
spotlightModel.getStatsForSurfaceForm("evrimleri")
spotlightModel.searchForDBpediaResource("ikimono_gakari_dbpedia_uri")
spotlightModel.addNew("ikimono_gakari_sf","ikimono_gakari_dbpedia_uri",1,Array())
spotlightModel.exportModels("/new/path/of/folder/model/")
tools/explore.scala
contains a script which can be preloaded into the scala terminal. It imports the classes and stores needed to play with the model at a low level.
In order to use it:
do JAVA_OPTS="-Xmx9000M -Xms9000M" scala
note: Adjust the Java heap options to your needs, If you are using all the stores use around 15g
once you are in the scala console do: :load tools/explore.scala
. this will preload the objects:
resStore
: resource storesfStore
: surface form storecandidateMap
: candidate storetokenStore
: token type storecontextStore
: context token storeIf you are interested in Knowledge Mining, NLP or Software Engineering you should take a look at our jobs page. We're always on the lookout for awesome people to join our team.
Copyright 2014 Idio
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0