ixa-pipe-nerc is a multilingual Named Entity tagger developed within the IXA pipes tools [http://ixa2.si.ehu.es/ixa-pipes]. Current version is 2.0.0
Please cite this paper if you use the tagger:
R. Agerri, G. Rigau, Robust multilingual Named Entity Recognition with shallow semi-supervised features. Artificial Intelligence, 238 (2016) 63-82. (http://dx.doi.org/10.1016/j.artint.2016.05.003)
Please go to (http://ixa2.si.ehu.es/ixa-pipes) for general information about the IXA pipes tools but also for official releases, including source code and binary packages for all the tools in the IXA pipes toolkit.
This document is intended to be the usage guide of ixa-pipe-nerc. If you really need to clone and install this repository instead of using the releases provided in [http://ixa2.si.ehu.es/ixa-pipes], please scroll down to the end of the document for the installation instructions.
ixa-pipe-nerc is in Maven Central for easy access to its API.
ixa-pipe-nerc provides models for Named Entity Recognition for Basque, Dutch, English, Galician, German, Italian and Spanish. The named entity types are based on:
We provide competitive models based on robust local features and exploiting unlabeled data via clustering features. The clustering features are based on Brown, Clark (2003) and Word2Vec clustering plus some gazetteers in some cases. To avoid duplication of efforts, we use and contribute to the API provided by the Apache OpenNLP project with our own custom developed features for each of the three tasks.
These models are to be used with the official IXA pipes 1.1.1 distribution.
Reproducing results with conlleval: Every result reported in Agerri and Rigau (2016) can be reproduced with the conlleval script using the conlleval-results scripts and the ixa-pipe-nerc contained in the IXA pipes 1.1.1 distribution.
NERC models:
Every model is trained with the averaged Perceptron algorithm as described in (Collins 2002) and as implemented in Apache OpenNLP.
Basque: eu-clusters model, trained on egunkaria dataset, F1 76.72 on 3 class evaluation and F1 75.40 on 4 classes.
English Models:
CoNLL 2003 models: We distribute models trained with local features and with external knowledge. Each of the models improve in F1 (reported on testb data) but they get somewhat slower:
CoNLL 2003 local + brown features: F1 88.50
CoNLL 2003 local + clark features: F1 88.97
CoNLL 2003 clusters + dicts: F1 91.36
Combined models: trained using Ontonotes 4.0, conll03 and muc 7 data, good for out of domain usage.
Spanish Models:
Dutch Models:
German Models:
Italian Models:
ixa-pipe-nerc provides a runable jar with the following command-line basic functionalities:
Each of these functionalities are accessible by adding (server|client|tag) as a subcommand to ixa-pipe-nerc-${version}-exec.jar. Please read below and check the -help parameter:
java -jar target/ixa-pipe-nerc-${version}-exec.jar (tag|server|client) -help
If you are in hurry, just execute:
cat file.txt | java -jar target/ixa-pipe-tok-$version-exec.jar tok -l en | java -jar ixa-pipe-pos-1.5.0-exec.jar tag -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin | java -jar $PATH/target/ixa-pipe-nerc-${version}-exec.jar tag -m model.bin
If you want to know more, please follow reading.
ixa-pipe-nerc reads NAF documents (with wf and term elements) via standard input and outputs NAF through standard output. The NAF format specification is here:
(http://wordpress.let.vupr.nl/naf/)
You can get the necessary input for ixa-pipe-nerc by piping ixa-pipe-tok and ixa-pipe-pos as shown in the example.
There are several options to tag with ixa-pipe-nerc:
Example:
cat file.txt | java -jar target/ixa-pipe-tok-$version-exec.jar tok -l en | java -jar ixa-pipe-pos-1.5.0-exec.jar tag -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin | java -jar $PATH/target/ixa-pipe-nerc-${version}-exec.jar tag -m nerc-models-$version/en/en-local-conll03.bin
We can start the TCP server as follows:
java -jar target/ixa-pipe-nerc-${version}-exec.jar server -l en --port 2060 -m en-model-conll03.bin
Once the server is running we can send NAF documents containing (at least) the term layer like this:
cat file.pos.naf | java -jar target/ixa-pipe-nerc-${version}-exec.jar client -p 2060
The easiest way to use ixa-pipe-nerc programatically is via Apache Maven. Add this dependency to your pom.xml:
<dependency>
<groupId>eus.ixa</groupId>
<artifactId>ixa-pipe-nerc</artifactId>
<version>1.6.0</version>
</dependency>
The javadoc of the module is located here:
ixa-pipe-nerc/target/ixa-pipe-nerc-$version-javadoc.jar
The contents of the module are the following:
+ formatter.xml Apache OpenNLP code formatter for Eclipse SDK
+ pom.xml maven pom file which deals with everything related to compilation and execution of the module
+ src/ java source code of the module and required resources
+ Furthermore, the installation process, as described in the README.md, will generate another directory:
target/ it contains binary executable and other directories
Installing the ixa-pipe-nerc requires the following steps:
If you already have installed in your machine the Java 1.8+ and MAVEN 3, please go to step 3 directly. Otherwise, follow these steps:
If you do not install JDK 1.8+ in a default location, you will probably need to configure the PATH in .bashrc or .bash_profile:
export JAVA_HOME=/yourpath/local/java8
export PATH=${JAVA_HOME}/bin:${PATH}
If you use tcsh you will need to specify it in your .login as follows:
setenv JAVA_HOME /usr/java/java18
setenv PATH ${JAVA_HOME}/bin:${PATH}
If you re-login into your shell and run the command
java -version
You should now see that your JDK is 1.8.
Download MAVEN 3.3.9+ from
https://maven.apache.org/download.cgi
Now you need to configure the PATH. For Bash Shell:
export MAVEN_HOME=/home/ragerri/local/apache-maven-3.3.9
export PATH=${MAVEN_HOME}/bin:${PATH}
For tcsh shell:
setenv MAVEN3_HOME ~/local/apache-maven-3.3.9
setenv PATH ${MAVEN3}/bin:{PATH}
If you re-login into your shell and run the command
mvn -version
You should see reference to the MAVEN version you have just installed plus the JDK that is using.
If you must get the module source code from here do this:
git clone https://github.com/ixa-ehu/ixa-pipe-nerc
Execute this command to compile ixa-pipe-nerc:
cd ixa-pipe-nerc
mvn clean package
This step will create a directory called target/ which contains various directories and files. Most importantly, there you will find the module executable:
ixa-pipe-nerc-${version}-exec.jar
This executable contains every dependency the module needs, so it is completely portable as long as you have a JVM 1.7 installed.
To install the module in the local maven repository, usually located in ~/.m2/, execute:
mvn clean install
Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
rodrigo.agerri@ehu.eus