Closed opensemanticsearch closed 6 years ago
Added hunspell rules and dictionaries to Solr config for named entities index / core.
Implemented export of optional stemming fields from ontologies / SKOS thesaurus from configuration of stemming options in Open Semantic Search Apps in Ontologies manager for each ontology and in Solr Ontology Tagger which exports them to entities index.
Todo: Configuration of the stemmer analyzers / field types in Solr entities core and setting ETL entity extraction config from the config out of the box.
Entity extraction / entity linking now support multiple taggers / analyzers / stemmers.
That sounds great!
@Mandalka Could you please give an example of how it differs from previous entity linking? What does "multiple taggers" mean? What type of taggers are there?
The previous entity extraction was done by shingling tokens of the full text while temporary indexing to a solr core and use a keepwordfilterfactory with a (therefore generated) plaintext list with the entities labels.
With the new Solr Text Tagger matching/extracting entities can by done without temporary indexing as document by the new tagger request handler and using optimized data structures in entity index which now can be fully configured by REST-API / standard Solr data posts instead using/managing additional plain text files for matching. This index is more powerful than keepwordfactory, for example the matching with full text can use more Solr analyzers like stemmers and has optimized data strucutres for matching (FST).
So Tagger mean a Solr Text Tagger, which extracts entities by matching entities from a field in entities index with the full text. Multiple taggers are used for example, if multiple stemmers like Porter and Hunspell stemmer are configured for the ontology or thesaurus, so we can use multiple / different index fields which can have different/multiple analyzer/stemmer settings.
@Mandalka thanks a lot for your detailed answer!
Implemented automatic config of stemming in ETL plugin for entity extraction by ontologies settings in Ontology Manager web UI.
Preconfigured some stemmers in Solr core for entities index. Using them by ETL plugin for named entity extraction can be activated for SKOS thesaurus or ontology by web config ui in Ontologies Manager.
@Mandalka
Has this new feature be documented? It'd be nice to have a simple example to illustrate how this helps with tagging.
I noticed a new configuration when uploading an ontology (Grammar / stemming (optional)). I tried setting "Use Grammar" and "Force Grammar" to English but got an error message (see below).
It looks like the Hunspell stemmer is only available in Hungarian. Can we add English as well?
Thanks for your help!
Request Method: | POST |
---|
http://localhost/search-apps/ontologies/create 1.11.11 ValueError unknown url type: '' /usr/lib/python3.6/urllib/request.py in _parse, line 384 /usr/bin/python3 3.6.7 ['/var/lib/opensemanticsearch', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3/dist-packages/opensemanticetl', '/'] Sat, 5 Jan 2019 22:21:00 +0000
(Optional) integration of stemming for dictionary/ontology/thesaurus based entity extraction