ixa-pipe-pos is a multilingual Part of Speech tagger and Lemmatizer, currently offering pre-trained models for eight languages: Basque, Dutch, English, French, Galician, German, Italian, and Spanish. ixa-pipe-pos is part of IXA pipes, a multilingual set of NLP tools developed by the IXA NLP Group [http://ixa2.si.ehu.es/ixa-pipes]. Current version is 1.5.2.
Please go to [http://ixa2.si.ehu.es/ixa-pipes] for general information about the IXA pipes tools but also for official releases, including source code and binary packages for all the tools in the IXA pipes toolkit.
This document is intended to be the usage guide of ixa-pipe-pos. If you really need to clone and install this repository instead of using the releases provided in [http://ixa2.si.ehu.es/ixa-pipes], please scroll down to the end of the document for the installation instructions.
ixa-pipe-pos provides statistical POS tagging and lemmatization several languages. We provide Perceptron (Collins 2002) and Maximum Entropy (Ratnapharki 1999) POS tagging and Lemmatization models trained on the following data for each language:
To avoid duplication of efforts, we use and contribute to the machine learning API provided by the Apache OpenNLP project. Additionally, we have added other features such as dictionary-based lemmatization, multiword and clitic pronoun treatment, post-processing via tag dictionaries, etc., as described below.
ixa-pipe-pos is distributed under Apache License version 2.0 (see LICENSE.txt for details).
Remember that for Galician and Spanish the output of the statistical models can be post-processed using the monosemic dictionaries provided via the --dictag CLI option.
We provide some dictionaries to modify the output of the statistical tagger and lemmatizer. To use them, pllease get and unpack the contents of this tarball in the src/main/resources/ directory inside ixa-pipe-pos before compilation:
lemmatizer-dicts.tar.gz package. Note that the dictionaries come with their own licences, please do comply with them:
Lemmatizer Dictionaries: "word\tablemma\tabpostag" dictionaries binarized as Finite State Automata using the morfologik-stemming project:
english.dict, galician.dict, spanish.dict. Via API you can also pass a plain text dictionary of the same tabulated format.
Multiword Dictionaries: "multi#word\tab\multi#lemma\tab\postag\tabambiguity" dictionaries to detect multiword expressions. Currently vailable:
es-locutions.dict for Spanish and gl-locutions.dict in Galician.
Monosemic Tag Dictionaries: the monosemic versions of the lemmatizer dictionaries. This is used for post-processing the results of the POS tagger if and when the option --dictag is activated in CLI. Currently available:
spanish-monosemic.dict, galician-monosemic.dict.
To use them, to download the package, copy it and untar it into the src/main/resources directory before compilation.
ixa-pipe-pos provides the following functionalities:
Each of these functionalities are accessible by adding (tag|train|eval|cross|server|client) as a subcommand to ixa-pipe-pos-$version.jar. Please read below and check the -help parameter ($version refers to the current ixa-pipe-pos version).
java -jar target/ixa-pipe-pos-1.5.2-exec.jar (tag|train|eval|cross|server|client) -help
If you are in hurry, Download or create a plain text file and use it like this:
cat guardian.txt | java -jar ixa-pipe-tok-1.8.5-exec.jar tok -l en | java -jar ixa-pipe-pos-1.5.2-exec.jar tag -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin
If you want to know more, please follow reading.
ixa-pipe-pos reads NAF documents containing wf elements via standard input and outputs NAF through standard output. The NAF format specification is here:
(http://wordpress.let.vupr.nl/naf/)
You can get the necessary input for ixa-pipe-pos by piping it with ixa-pipe-tok.
There are several options to tag with ixa-pipe-pos:
Tagging Example:
Download or create a plain text file and use it like this:
cat guardian.txt | java -jar ixa-pipe-tok-1.8.5.jar tok -l en | java -jar ixa-pipe-pos-1.5.2.jar tag -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin
Remember to download some models from the distributed packages!!
We can start the TCP server as follows:
java -jar target/ixa-pipe-pos-1.5.2-exec.jar server -l en --port 2040 -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin
Once the server is running we can send NAF documents containing (at least) the text layer like this:
cat guardian.txt | java -jar ixa-pipe-tok-1.8.5-exec.jar tok -l en | java -jar ixa-pipe-pos-1.5.2-exec.jar client -p 2040
To train a new model, you just need to pass a training parameters file as an argument. Every training option is documented in the template trainParams.properties file.
Example:
java -jar target/ixa.pipe.pos-$version-exec.jar train -p trainParams.properties
To evaluate a trained model, the eval subcommand provides the following options:
Example:
java -jar target/ixa.pipe.pos-$version-exec.jar eval -c pos -m test-pos.bin -l en -t test.data
The easiest way to use ixa-pipe-pos programatically is via Apache Maven. Add this dependency to your pom.xml:
<dependency>
<groupId>eus.ixa</groupId>
<artifactId>ixa-pipe-pos</artifactId>
<version>1.5.2</version>
</dependency>
The javadoc of the module is located here:
ixa-pipe-pos/target/ixa-pipe-pos-$version-javadoc.jar
The contents of the module are the following:
+ formatter.xml Apache OpenNLP code formatter for Eclipse SDK
+ pom.xml maven pom file which deals with everything related to compilation and execution of the module
+ src/ java source code of the module and required resources
+ trainParams.properties A template properties file containing documention
+ Furthermore, the installation process, as described in the README.md, will generate another directory:
target/ it contains binary executable and other directories
Installing the ixa-pipe-pos requires the following steps:
If you already have installed in your machine the Java 1.8+ and MAVEN 3, please go to step 3 directly. Otherwise, follow these steps:
If you do not install JDK 1.7+ in a default location, you will probably need to configure the PATH in .bashrc or .bash_profile:
export JAVA_HOME=$pwd/java8
export PATH=${JAVA_HOME}/bin:${PATH}
Replacing $pwd with the full path given by typing the pwd inside the java directory.
If you use tcsh you will need to specify it in your .login as follows:
setenv JAVA_HOME $pwd/java8
setenv PATH ${JAVA_HOME}/bin:${PATH}
If you re-login into your shell and run the command
java -version
You should now see that your JDK is 1.7+
Download MAVEN 3 from
wget http://apache.rediris.es/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
Now you need to configure the PATH. For Bash Shell:
export MAVEN_HOME=$pwd/apache-maven-3.0.5
export PATH=${MAVEN_HOME}/bin:${PATH}
Replacing $pwd with the full path given by typing the pwd inside the apache maven directory.
For tcsh shell:
setenv MAVEN3_HOME $pwd/apache-maven-3.0.5
setenv PATH ${MAVEN3}/bin:{PATH}
If you re-login into your shell and run the command
mvn -version
You should see reference to the MAVEN version you have just installed plus the JDK 7 that is using.
If you must get the module source code from here do this:
git clone https://github.com/ixa-ehu/ixa-pipe-pos
Download the POS tagging and lemmatization models:
Additionally, we distribute dictionaries to correct the output of the statistical lemmatization. To use them, you will need to download the resources and copy them to ixa-pipe-pos/src/main/resources/ before compilation for the module to use:
Download the resources and untar the archive into the src/main/resources directory:
cd ixa-pipe-pos/src/main/resources
wget http://ixa2.si.ehu.es/ixa-pipes/models/lemmatizer-dicts.tar.gz
tar xvzf lemmatizer-dicts.tar.gz
The lemmatizer-dicts contains the required dictionaries to help the statistical lemmatization.
cd ixa-pipe-pos
mvn clean package
This step will create a directory called target/ which contains various directories and files. Most importantly, there you will find the module executable:
ixa-pipe-pos-$version-exec.jar
This executable contains every dependency the module needs, so it is completely portable as long as you have a JVM 1.7 or newer installed.
To install the module in the local maven repository, usually located in ~/.m2/, execute:
mvn clean install
To add your language to ixa-pipe-pos the following steps are required:
Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
rodrigo.agerri@ehu.eus