hipe-eval / HIPE-scorer

A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).
https://hipe-eval.github.io
MIT License
13 stars 4 forks source link
evaluation machine-learning named-entity-linking named-entity-recognition

HIPE-scorer

The HIPE-scorer is a python module for evaluating Named Entity Recognition and Classification (NER) and Entity Linking (EL) systems.

It has been developed and used in the context of the HIPE ('Identifying Historical People, Places and other Entities') shared tasks on NE processing on historical documents, with two evaluation campaigns:

Website Data Evalution Toolkit Results
HIPE-2022 HIPE-2022 HIPE-2022-data HIPE-2022-eval [HIPE 2022 results]()
CLEF-HIPE-2020 CLEF-HIPE-2020 CLEF-HIPE-2020 CLEF-HIPE-2020-eval CLEF HIPE 2020 results

Release history

Main functionalities
Installation
CLI usage
Forthcoming
License

Main functionalities

The scorer evaluates at the entity level, whereby entities (most often multi-words) are considered as the reference units, with a specific type as well as a token-based onset and offset. In the case of EL, the reference ID of an entity (or link) is considered as the label.

Metrics

For both NERC and EL, the scorer compute the following metrics:

Please note that our definition of the macro sceheme differs from the usual one: macro measures are computed as aggregates at document-level and not at entity-type level. Specifically, the macro measures average the corresponding micro scores across all documents. This allow to account for variance in (historical) document length and entity distribution within documents, instead of overall class imbalances.

Measures are calculated separately by entity type, and cumulatively for all types.

Evaluation regimes

There are different evaluation regimes depending on how strictly entity type and boundaries correctness is judged. The scorer provides strict and fuzzy evaluation regimes for both NERC and EL, as follows:

NERC

Entity Linking

For both EL fuzzy and relaxed setting, the number of link predictions taken into account can be adapted, i.e. system can provide multiple links or QIDs (separated by | ). The scorer can evaluate with cutoffs @1, @3 and @5.

Installation

The scorer requires python 3 and and the module itself needs to be installed as an editable dependency:

$ python3 -mvenv venv
$ source venv/bin/activate
$ pip3 install -r requirements.txt
$ # for development
$ pip3 install -e .

CLI Usage

Input data format is similar to CoNLL-U, with multiple columns recording different annotations (when appropriate or needed) per entity tokens. Supported tagging schemes are IOB and IOBES.

Below is an example, see also the CLEF-HIPE-2020 and the HIPE-2022 participation guidelines for more details.

TOKEN   NE-COARSE-LIT   NE-COARSE-METO  NE-FINE-LIT NE-FINE-METO    NE-FINE-COMP    NE-NESTED   NEL-LIT NEL-METO    MISC
# hipe2022:document_id = NZZ-1798-01-20-a-p0002
# hipe2022:date = 1798-01-20
# ...
berichtet   O   O   O   O   O   O   _   _   _
der O   O   O   O   O   O   _   _   _
General B-pers  O   B-pers.ind  O   B-comp.title    O   Q321765 _   _
Hutchinson  I-pers  O   I-pers.ind  O   B-comp.name O   Q321765 _   EndOfLine
—   O   O   O   O   O   O   _   _   _

Standard evaluation

To evaluate the predictions of your system, run the following command:

python clef_evaluation.py --ref GOLD.tsv --pred PREDICTIONS.tsv --task TASK --outdir RESULT_FOLDER

Main parameters are (clef_evaluation.py -h to see full description):

Format requirements The script expects both system response and gold standard files to have a similar structure (same number of columns) as well as similar content (same number of token lines, in the exact same order). Any comment lines starting with a # may be omitted. The script will try to reconstruct the segmentation according to the gold standard automatically. In cases of unresolvable mismatches, the evaluation fails and outputs information about the issue.

Advanced Evaluation

The scorer allows for a detailed evaluation of performance on diachronic and noisy data for NERC and EL.

If you provide more than one of these advanced evaluation options, all possible combinations will be computed.

Output

The evaluation script outputs two files in the provided output folder:

System Evaluation Label P R F1 F1_std P_std R_std TP FP FN
TEAMNAME_TASKBUNDLEID_LANG_RUNNUMBER NE-FINE-COMP-micro-fuzzy ALL

Cells may be empty in case they are not defined or provide only redundant information. The column Evaluation refers to the evaluated column and defines the measures P, R, F1, etc. It has the following structure: COL_NAME-{micro/macro_doc}-{fuzzy-strict}. This schema makes it easy to filter for a particular metric with grep.

Evaluation regimes (according to the script's internal naming):

Forthcoming:

Contributors

The very first version of the HIPE scorer was inspired from David Batista's NER-Evaluation module (see also this blog post).

License

The HIPE-scorer is licensed under the MIT License - see the license file for details.