idio / spotlight-model-editor

Tool for tweaking dbpedia spotlight's models
Apache License 2.0
16 stars 8 forks source link

IMPORTANT: No longer in active development.


Spotlight Model's Editor

Idio's Spotlight Model Editor allows you to manually tweak dbpedia spotlight's models. Thus it allows you to manually:

In order to use the Model Editor, you will need:

We also recommend using IntelliJ, for editing the code. See below, for instructions on how to set up a project.

Compiling

We assume that you have the correct versions of Java and mvn in your system.

The language models consume a lot of computational resources, so in these instructions we use the model for Turkish (located in the tr folder). Feel free to play with other languages, if you have a big machine.

Compiling Idio's Dbpedia Model Editor

  1. Clone this repo
  2. go to the repo's folder
  3. do mvn package appassembler:assemble
  4. call
sh target/bin/model-editor explore path-to-model/en/model/ 20

it should print the stats for 20 surface forms

Step 3 generates a jar with all the dependencies in target folder. Then it generates a script with default values for calling the jar. The script calls the jar with default values for the heap (15g). If you want to override this value you can modify: (i) the pom appassembler-maven-plugin settings in the pom, or (ii) call the jar directly java -xmx.. -jar ... followed by the commands shown in this readme.

Importing Project

  1. Get IntelliJ
  2. Go to File>Import Project -> Select POM Project
  3. Give enough RAM to run the project. Go to Preferences -> Compiler and add '-Xmx5G' to 'Aditional VM options',
  4. Navigate to the SpotlightModelReader class, right click Main and select run scala console, enjoy

Editing a model

start by freeing as much ram as possible.

Each of the following tools addressing a command refers to calling the jar/script as follows

using the generated script:

sh target/bin/model-editor <command> <subcommand> arg1 arg2

using the generated jar:

java -Xmx15g -Xms15g -jar target/idio-spotlight-model-0.1.0-jar-with-dependencies.jar <command> <subcommand> arg1 arg2

Exploring a Model

example:

sh target/bin/model-editor explore path-to-turkish/tr/model/ 40

Topics

All topic related actions are carried out using the topic command followed by one of the following subcommands:

Searching a Topic

i.e :

sh target/bin/model-editor topic search path/to/model Michael_Schumacher‎

Check the Context words and counts of a topic

example:

sh target/bin/model-editor topic check-context /mnt/share/spotlight/en/model Barack_Obama\|United_States

Set the Context Words of a Topic

each line of the given input file should be like:

dbpediaUri <tab> contextWordsSeparatedByPipe <tab> countsSeparatedByPipe

the size of contextWordsSeparatedByPipe and countsSeparatedByPipe should be the same

example:

sh target/bin/model-editor topic clean-set-context /mnt/share/spotlight/en/model folder/fileWithContextChanges 

Surface Forms

All surface forms related actions are carried out using the surfaceform command followed by one of the following subcommands:

stats of a surface form

example :

sh target/bin/model-editor surfaceform stats ~/Downloads/tr/model/ evrimleri

outputs statistics for the surface form evrimleri

getting the candidate topics of a surface form

example :

sh target/bin/model-editor surfaceform candidates ~/Downloads/tr/model/ evrimleri

would check the candidate topics for the surface form evrimleri

Making a list of Surface Forms Unspottable

sh target/bin/model-editor surfaceform make-unspottable path/to/model surfaceForm1\|surfaceForm2\|
sh target/bin/model-editor surfaceform make-unspottable path/to/model pathTo/File/withSF -f

Copy Candidates

example:

sh target/bin/model-editor surfaceform copy-candidates path/to/model pathToFile

Making a list of Surface Forms Spottable

example:

sh target/bin/model-editor surfaceform make-spottable path/to/model surfaceForm1\|surfaceForm2\|
sh target/bin/model-editor surfaceform make-spottable path/to/model pathTo/File/withSF -f

Associations

All surface forms related actions are carried out using the association command followed by one of the following subcommands:

Deleting Associations between SF and Topics

Every line in the input file describes an association which will be deleted, each line should follow the format:

dbpediaURI <tab> Surface Form

example:

sh target/bin/model-editor association remove /mnt/share/spotlight/en/model /path/to/file/file_with_associations

Updating Model From File

When updating the model with lots of SF, Topics and Context Words best is to do it from a file. each line of the file should follow the format:

dbpedia_id <tab> surfaceForm1|surfaceForm2... <tab> contextW1|contextW2... <tab> contextW1Counts|ContextW2Counts

Insight

Before doing actual changes to the model it might be useful to see how many SF,dbpedia topics and links between those two are missing. sh target/bin/model-editor file-update check path/to/en/model path_to_file/with/model/changes.

Updating a model From File (All in One Go)

make sure you have enough ram to hold all the models that should be around 15g. do:

sh target/bin/model-editor file-update all path/to/en/model path_to_file/with/model/changes

Updating a model From File (Two Steps)

If you don't have enough ram you can update the SF and DbpediaTopics in one step and the Context Words in other, this will require less memory.

  1. go to the model folder and rename context.mem to context2.mem this will avoid the jar to avoid loading the context store
  2. calling the following command will update the surfaceform store, resource store and candidate store: sh target/bin/model-editor file-update all path/to/en/model path_to_file/with/model/changes.
  3. a new file path_to_file/with/model/changes_just_context will be generated after running the previous command.This file contains dbpediaIds(internal model's indexes) to contextWords, and it can be processed in the following step.
  4. rename context2.mem to context.mem, and rename every other file in the model folder to something else.( if this is not done, the stores will be loaded and they will consume all your ram)
  5. calling the following will update the context store:
    sh target/bin/model-editor file-update context-only path/to/en/model path_to_file/with/model/changes_just_context
  6. rename all files to their usual conventions and enjoy a fresh baked model

steps 1-4 could be applied while ignoring 5 and 6 when:

steps 5-6 could be applied while ignoring 1-4 when:

Important:

Using the scala console

Best way to play the models and modify them is to use the scala console.

Starting a scala console

Playing with the models

Once you start a scala console you can use it like ipython to create instances of the scala classes we have, to load the models, check if dbpedia id's exist, add new dbpedia ids, add new surface forms etc..

do: :cp pathTo/ModelEditor.jar

This will load the classes inside the model editor. After that you should be able to play with the classes inside the jar.

Example:

var spotlightModel = org.idio.dbpedia.spotlight.Main.getSpotlightModel( "/Users/dav009/Downloads/tr/model/")
spotlightModel.showSomeSurfaceForms()
spotlightModel.getStatsForSurfaceForm("evrimleri")
spotlightModel.searchForDBpediaResource("ikimono_gakari_dbpedia_uri")
spotlightModel.addNew("ikimono_gakari_sf","ikimono_gakari_dbpedia_uri",1,Array())
spotlightModel.exportModels("/new/path/of/folder/model/")

tools/explore.scala contains a script which can be preloaded into the scala terminal. It imports the classes and stores needed to play with the model at a low level. In order to use it:

  1. do JAVA_OPTS="-Xmx9000M -Xms9000M" scala note: Adjust the Java heap options to your needs, If you are using all the stores use around 15g

  2. once you are in the scala console do: :load tools/explore.scala . this will preload the objects:

    • resStore: resource store
    • sfStore: surface form store
    • candidateMap: candidate store
    • tokenStore: token type store
    • contextStore: context token store

Join idio!

If you are interested in Knowledge Mining, NLP or Software Engineering you should take a look at our jobs page. We're always on the lookout for awesome people to join our team.

License

Copyright 2014 Idio

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0