semanticize / semanticizest

Standalone Semanticizer
Apache License 2.0
32 stars 15 forks source link

Semanticizer, standalone

Semanticizest is a package for doing entity linking, also known as semantic linking or semanticizing: you feed it text, and it outputs links to pertinent Wikipedia concepts. You can use these links as a "semantic representation" of the text for NLP or machine learning, or just to provide some links to background info on the Wikipedia.

Installation

Usage

To train a semanticizer, download a Wikipedia database dump from https://dumps.wikimedia.org/. Then issue::

python -m semanticizest.parse_wikidump <dump> <model-filename>

The result will be a semanticizer model (in SQLite 3 format, if you must know).

Alternatively, you can use the --download flag to instruct semanticizest to download the LATEST wikipedia dump. For example, to download and process the Scottish Wikipedia_ (which is small and useful for testing)::

python -m semanticizest.parse_wikidump --download scowiki sco.model

will download https://dumps.wikimedia.org/scowiki/latest/scowiki-latest-pages-articles.xml.bz2 to scowiki.xml.bz2 and construct the model from it.

Documentation

Full documentation can be found at https://semanticize.github.io/semanticizest/

Copyright and license

Copyright 2014 University of Amsterdam/Netherlands eScience Center. The license for the semanticizest is Apache License, Version 2.0_. See the file LICENSE for details.

.. Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0.html .. Scottish Wikipedia: https://sco.wikipedia.org