opensemanticsearch / open-semantic-entity-search-api

Open Source REST API for named entity extraction, named entity linking, named entity disambiguation, recommendation & reconciliation of entities like persons, organizations and places for (semi)automatic semantic tagging & analysis of documents by linked data knowledge graph like SKOS thesaurus, RDF ontology, database(s) or list(s) of names
https://opensemanticsearch.org/doc/datamanagement/named_entity_recognition
GNU General Public License v3.0
178 stars 32 forks source link

Stemming for entity extraction #34

Closed opensemanticsearch closed 6 years ago

opensemanticsearch commented 6 years ago

(Optional) integration of stemming for dictionary/ontology/thesaurus based entity extraction

Mandalka commented 6 years ago

Added hunspell rules and dictionaries to Solr config for named entities index / core.

Mandalka commented 6 years ago

Implemented export of optional stemming fields from ontologies / SKOS thesaurus from configuration of stemming options in Open Semantic Search Apps in Ontologies manager for each ontology and in Solr Ontology Tagger which exports them to entities index.

Todo: Configuration of the stemmer analyzers / field types in Solr entities core and setting ETL entity extraction config from the config out of the box.

Mandalka commented 6 years ago

Entity extraction / entity linking now support multiple taggers / analyzers / stemmers.

YoannMR commented 6 years ago

That sounds great!

@Mandalka Could you please give an example of how it differs from previous entity linking? What does "multiple taggers" mean? What type of taggers are there?

Mandalka commented 6 years ago

The previous entity extraction was done by shingling tokens of the full text while temporary indexing to a solr core and use a keepwordfilterfactory with a (therefore generated) plaintext list with the entities labels.

With the new Solr Text Tagger matching/extracting entities can by done without temporary indexing as document by the new tagger request handler and using optimized data structures in entity index which now can be fully configured by REST-API / standard Solr data posts instead using/managing additional plain text files for matching. This index is more powerful than keepwordfactory, for example the matching with full text can use more Solr analyzers like stemmers and has optimized data strucutres for matching (FST).

So Tagger mean a Solr Text Tagger, which extracts entities by matching entities from a field in entities index with the full text. Multiple taggers are used for example, if multiple stemmers like Porter and Hunspell stemmer are configured for the ontology or thesaurus, so we can use multiple / different index fields which can have different/multiple analyzer/stemmer settings.

YoannMR commented 6 years ago

@Mandalka thanks a lot for your detailed answer!

Mandalka commented 6 years ago

Implemented automatic config of stemming in ETL plugin for entity extraction by ontologies settings in Ontology Manager web UI.

Mandalka commented 6 years ago

Preconfigured some stemmers in Solr core for entities index. Using them by ETL plugin for named entity extraction can be activated for SKOS thesaurus or ontology by web config ui in Ontologies Manager.

YoannMR commented 5 years ago

@Mandalka

Has this new feature be documented? It'd be nice to have a simple example to illustrate how this helps with tagging.

I noticed a new configuration when uploading an ontology (Grammar / stemming (optional)). I tried setting "Use Grammar" and "Force Grammar" to English but got an error message (see below).

It looks like the Hunspell stemmer is only available in Hungarian. Can we add English as well?

Thanks for your help!

Request Method: POST

http://localhost/search-apps/ontologies/create 1.11.11 ValueError unknown url type: '' /usr/lib/python3.6/urllib/request.py in _parse, line 384 /usr/bin/python3 3.6.7 ['/var/lib/opensemanticsearch', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3/dist-packages/opensemanticetl', '/'] Sat, 5 Jan 2019 22:21:00 +0000