Dragonfly is a tool for performing named entity annotations. It has been designed to especially support and accelerate annotating low resource languages by non-speakers.
This requires python3 for unicode support. Some operating systems make python and pip available as pip3 and python3.
Copyright 2017-2019 Johns Hopkins University Applied Physics Laboratory
Licensed under the Apache License, Version 2.0
pip3 install -r requirements.txt
To annotate a directory of tsv files:
python3 annotate.py [lang] [data dir]
lang
is the ISO 639 language code (recommend 639-3).
A complete list is available on the SIL website.
By default, annotations are saved in a sub-directory called annotations where the input files are located. To change this:
python3 annotate.py -o [annotation dir] [lang] [data dir]
Once the server is running, direct your browser to http://localhost:5000/ (Additional instances can be run concurrently by selecting a different port with the -p option.)
Dragonfly creates a .dragonfly
directory in the user's home directory.
It saves settings and translation dictionaries to this directory.
The tag types can be specified on the command line using the -t option:
python3 annotate.py -t PER,ORG,LOC,MISC [lang] [data_dir]
The default tag set is PER, ORG, GPE, LOC in support of LoReHLT. O is the tag applied to tokens that are outside of named entities.
Dragonfly expects tab separated value files which we will call CoNLL format. The first column are the tokens to be annotated. Additional columns provide context for the tokens and could include transliterations, translations, or other semantic information. Sentences should be followed by an empty line.
If the first column of the first row has a value of 'TOKEN', the first row is treated as a header.
The output is a two column tsv file of token and tag.
The port of the server is selected with the -p
option:
python3 annotate.py --port 8888 [lang] [data_dir]
To run more than one site on the same port, configure a reverse proxy server with each Dragonfly instance responding to a different URL prefix:
python3 annotate.py --prefix deu [lang] [data_dir]
The cascade will tag all matching tokens in the document unless that token has been previously tagged.
Toggle the cascade through its checkbox in the navigation bar or by pressing 'c'. To persist the cascade setting between documents, use the Settings dialog.
Multi-token tags cannot span sentences.
To copy tokens to your clipboard, press 's' to enter select mode. Then click on the first and the last token to copy in a single sentence. Clicking the same token twice copies that token.
The annotation tool supports user-maintained translation dictionaries. Language-specific files are used to highlight tokens that have been translated. Only single tokens may be translated at a time.
The tools window provides the capability to import and export translation dictionaries. The json object is a dictionary with tokens as the keys and a list of translation and type as the value. Importing a dictionary will not overwrite anything that you have already translated.
Annotation hints can be configured in the tools dialog. Hints are represented as two columns with the first column being regular expressions and the second column is text that explains the hint. The row for the hints is configured in the settings. An example from Amharic would be:
^ba prefix that represents a preposition 'in' or 'on'
Anything that matches the hint is highlighted. The regular expressions are processed from top to bottom and only the first match is used.
When annotating the file x.txt
, if there is a file in the same directory with the name x.txt.eng
,
it is loaded and displayed above the token row. The file should be a sentence segmented file with one
sentence per line that matches the divisions in the CoNLL formatted input file.
The tool will suggest words with a light brown background that you have been frequently tagging. To customize your suggestions, add an additional column to the input files and tell the tool to use a particular column:
python3 annotate.py --suggest ENTROPY [lang] [data_dir]
Settings are persisted between annotation sessions. Most take effect immediately upon a change. The exception is column width which requires a page reload.
The behavior of Dragonfly can be changed by passing mode information.
To view the progress of an annotator, view the URL they are using with the URL parameter
view=true
set. The ability to save is removed and if the site is configured for auto save
you can reload the page to see their annotations.
Passing the --rtl
flag on the command line, changes the display to Right To Left.
Setting --simple
on the command line removes parts of the UI to help prevent distracting new annotators.
Finally, the --debug
flag is useful during development or working through errors.
Once you have annotations from more than one annotator on the same set of files, you may want to compare and adjudicate among them.
python3 adjudicate.py -o [output dir] [lang] [data dir] [annotation dir1] [annotation dir2] ...
The default tags come from the first annotation directory specified on the command line. The adjudicated annotations are stored to the output directory (a required argument). The UI works the same as annotating.
To run the unit tests:
nosetests tests/python