Looking for taxon mentions in text? Ask TaxoNERD
I would be happy to hear about your use of TaxoNERD : what is your use case? How did TaxoNERD help you? What could make TaxoNERD even more helpful? Please feel free to drop me an email (nicolas[dot]leguillarme[at]univ-grenoble-alpes[dot]fr) or to open an issue.
Le Guillarme, N., & Thuiller, W. (2022). TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. Methods in Ecology and Evolution, 13(3), 625-641.
TaxoNERD is a domain-specific tool for recognizing taxon mentions in the biodiversity literature.
:tada: It is now possible to use custom taxonomies for entity linking ! Check our example Notebook
taxonerd ask -m en_ner_eco_biobert_weak -f ./tests/test_data/test_jpg/test.jpg
T0 LIVB 180 192 Harbour seal
T1 LIVB 194 208 Phoca vitulina
T2 LIVB 361 375 Pacific salmon
T3 LIVB 377 394 Oncorhynchus spp.
T4 LIVB 455 467 harbour seal
T5 LIVB 793 805 harbour seal
T6 LIVB 1127 1133 fishes
T7 LIVB 1137 1148 cephalopods
Model | Description | Install URL |
---|---|---|
en_ner_eco_md | A spaCy NER model with 50k word vectors (taken from en_core_sci_md), fine-tuned on an ecological corpus. | Download |
en_ner_eco_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on an ecological corpus. | Download |
en_core_eco_md_weak | A spaCy NER model with 50k word vectors (taken from en_core_sci_md), fine-tuned on a silver standard corpus (for improved performance on vernacular names). | Download |
en_core_eco_biobert_weak | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on a silver standard corpus (for improved performance on vernacular names). | Download |
If you have access to a GPU, we recommend using one of the biobert models as they tend to be more accurate than the md models.
The en_core_eco_md_weak and en_core_eco_biobert_weak have been fine-tuned on a silver standard corpus generated using weak supervision. Therefore, they have been trained on a much larger amount of (noisy) data than their gold standard counterparts. As a result, they tend to have better recall, especially with respect to common names detection. They also have high precision. Nevertheless, their performance has not been accurately evaluated.
If you do not trust weakly-supervised data and you are not really interested in detecting common names, en_core_eco_md and en_core_eco_biobert are for you. These models have been fine-tuned on a gold standard corpus (a combination of COPIOUS, Bacteria Biotope 2019, and BiodivNERE) and their performance has been benchmarked in our paper.
Installing the package from pip will automatically install all dependencies, including pandas, spaCy, scispaCy and textract. Make sure you install this package before you install the models. Also note that this package requires Python 3.10 and spaCy v3.7.
$ pip install taxonerd
For GPU support, find your CUDA version using nvcc --version
and add the version in brackets, e.g. pip install taxonerd[cuda12x]
for CUDA 12.1. See setup.cfg for supported CUDA versions.
To download the models:
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz
Entity linker files are downloaded and cached the first time the linker is used. This may take some time, but it should only be done once. Currently (v1.5.4), there are 3 supported linkers:
> install.packages("https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/taxonerd_for_R_1.5.4.tar.gz", repos=NULL)
> vignette("taxonerd") # See vignette for more information on how to install and use TaxoNERD for R
TaxoNERD can be used as:
$ taxonerd ask --help
Usage: taxonerd ask [OPTIONS] [INPUT_TEXT]
Options:
-m, --model TEXT A TaxoNERD model [default = en_ner_eco_md]
-i, --input-dir TEXT Input directory
-o, --output-dir TEXT Output directory
-f, --filename TEXT Input text file
-a, --with-abbrev Add abbreviation detector to the pipeline
-s, --with-sentence Add sentence segmenter to the pipeline
-l, --link-to TEXT Add entity linker to the pipeline
-t, --thresh FLOAT Similarity threshold for entity linking [default =
0.7]
--prefer-gpu Use GPU if available
-v, --verbose Verbose mode
--help Show this message and exit.
$ taxonerd ask -m en_ner_eco_biobert "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0 LIVB 0 11 Brown bears
T1 LIVB 13 25 Ursus arctos
$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0 LIVB 0 11 Brown bears [('GBIF:2433433', 'Brown Bear', 0.8313919901847839)]
T1 LIVB 13 25 Ursus arctos [('GBIF:2433433', 'Ursus arctos', 1.0)]
$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone -t 0.85 "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0 LIVB 13 25 Ursus arctos [('GBIF:2433433', 'Ursus arctos', 1.0)]
$ taxonerd ask -m en_ner_eco_biobert --with-abbrev -f ./tests/test_data/test_txt/test1.txt
T0 LIVB 4 21 pinewood nematode
T1 LIVB 23 26 PWN
T2 LIVB 29 55 Bursaphelenchus xylophilus
T3 LIVB 57 70 B. xylophilus
T4 LIVB 99 108 pine wilt
T5 LIVB 196 204 Serratia
T6 LIVB 257 260 PWN
T7 LIVB 342 364 Serratia grimesii BXF1
T8 LIVB 387 390 PWN
T9 LIVB 440 444 BXF1
$ taxonerd ask --focus-on accuracy -i ./tests/test_data/test_txt -o test_ann
$ ls test_ann/
test1.ann test2.ann
$ cat test_ann/test2.ann
T0 LIVB 700 711 Brown bears
T1 LIVB 713 725 Ursus arctos
T2 LIVB 1062 1073 brown bears
T3 LIVB 1161 1172 brown bears
T4 LIVB 1339 1350 brown bears
T5 LIVB 1555 1565 brown bear
T6 LIVB 1782 1793 brown bears
T7 LIVB 1863 1874 brown bears
T8 LIVB 1958 1969 brown bears
T9 LIVB 1974 1980 salmon
T10 LIVB 2026 2037 brown bears
T11 LIVB 2219 2230 brown bears
T12 LIVB 2392 2401 Sika deer
T13 LIVB 2403 2416 Cervus nippon
T14 LIVB 2555 2559 deer
T15 LIVB 2594 2604 brown bear
T16 LIVB 2798 2808 brown bear
T17 LIVB 3146 3150 deer
T18 LIVB 3188 3199 chum salmon
T19 LIVB 3201 3218 Oncorhynchus keta
T20 LIVB 3280 3289 Sika deer
T21 LIVB 3350 3361 pink salmon
T22 LIVB 3363 3375 O. gorbuscha
T23 LIVB 3381 3392 chum salmon
T24 LIVB 3518 3528 Brown bear
T25 LIVB 4001 4012 brown bears
T26 LIVB 4071 4082 brown bears
>>> from taxonerd import TaxoNERD
>>> taxonerd = TaxoNERD(prefer_gpu=False)
>>> nlp = taxonerd.load(model="en_ner_eco_md", exclude=[], linker="taxref", threshold=0.7)
>>> nlp.pipe_names
['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'pysbd_sentencizer', 'parser', 'ner', 'taxo_abbrev_detector', 'taxon_linker']
N.B. By default, all components are included in the pipeline. Use the exclude
argument to specify the components to exclude. Excluded components won’t be loaded. This may speed up the detection process. The minimal pipeline for taxonomic NER is ['tok2vec', 'ner']
.
>>> taxonerd.find_in_text("Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore")
offsets text entity sent
T0 LIVB 13 25 Ursus arctos [(TAXREF:60826, Ursus arctos, 1.0)] 0
>>> taxonerd.find_in_file("./tests/test_data/test_txt/test2.txt", output_dir=None)
offsets text entity sent
T0 LIVB 713 725 Ursus arctos [(TAXREF:60826, Ursus arctos, 1.0)] 4
T1 LIVB 1974 1980 salmon [(TAXREF:730671, Salmonia, 0.85158771276474)] 12
T2 LIVB 2392 2401 Sika deer [(TAXREF:61025, Sika Deer, 1.0)] 14
T3 LIVB 2403 2416 Cervus nippon [(TAXREF:61025, Cervus nippon, 1.0)] 14
T4 LIVB 3135 3141 salmon [(TAXREF:730671, Salmonia, 0.85158771276474)] 18
T5 LIVB 3146 3150 deer [(TAXREF:186210, deer, 1.0)] 18
T6 LIVB 3188 3199 chum salmon [(TAXREF:730671, Salmonia, 0.7018352746963501)] 19
T7 LIVB 3201 3218 Oncorhynchus keta [(TAXREF:195439, Oncorhynchus, 0.8319146037101... 19
T8 LIVB 3280 3289 Sika deer [(TAXREF:61025, Sika Deer, 1.0)] 19
T9 LIVB 3350 3361 pink salmon [(TAXREF:67798, Pink Salmon, 1.0)] 20
T10 LIVB 3381 3392 chum salmon [(TAXREF:730671, Salmonia, 0.7018352746963501)] 20
T11 LIVB 3481 3485 deer [(TAXREF:186210, deer, 1.0)] 20
>>> taxonerd.find_in_corpus("./tests/test_data/test_txt", "./test_ann")
{'test1.txt': './test_ann/test1.ann', 'test2.txt': './test_ann/test2.ann'}
>>> from taxonerd import TaxoNERD
>>> taxonerd = TaxoNERD(prefer_gpu=True)
>>> nlp = taxonerd.load(model="en_ner_eco_biobert")
>>> doc = nlp("Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore")
>>> doc.ents
(Brown bears, Ursus arctos)
>>> [tok.lemma_ for tok in doc]
['Brown', 'bear', '(', 'ursus', 'arcto', ')', ',', 'which', 'be', 'widely', 'distribute', 'throughout', 'the', 'northern', 'hemisphere', ',', 'be', 'recognise', 'as', 'opportunistic', 'omnivore']
More examples in our demo Notebook.
License: MIT
TaxoNERD was written by nleguillarme.