opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

problem with umlaut open-semantic-search_18.05.07.deb ubuntu #62

Closed clamor closed 6 years ago

clamor commented 6 years ago

I've entered some concepts into the thesaurus with and without umlauts in the label. The concepts were part of different facets (persons, locations). All concepts make it correctly into the solr opensemanticsearch-entities index. Named entity recognition is disabled, document languages and ocr are set to english and german.

When indexing a new document (opensemantic-index-file) for the first time the document only gets tagged with the concepts without umlauts in the label and only these concepts are displayed in the facets. Looking in to the solr opensemanticsearch index the document is only tagged with the concepts without umlauts.

Hitting /search-apps/thesaurus/apply (Tag (new) documents) the document gets tagged with all concepts correctly. But when indexing the document again ("touch" and "opensemantic-index-file") the concepts with umlauts in label are gone again.

The concepts with the umlauts never make it into the neo4j db.

Mandalka commented 6 years ago

Which is the exact content type and/or with encoding/charset of the affected document (facet "content type" in meta data or in sidebar on preview or in Solr content_type_ss)?

Would make debugging easier, since my first test on Debian with UTF-8 runned without problems. I'll test tomorrow on Ubuntu again but would like to try with same content type/encoding/charset.

clamor commented 6 years ago

solr says: "content_type_ss":["application/pdf"]

I gave debian a try, but it shows the same behaviour on debian/stretch64. Btw. OS locale is LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8".

If it helps, I can send the Document I used via private mail.

Mandalka commented 6 years ago

Would help a lot, if not sensitive, you can attach it in email to info@opensemanticsearch.org or send your email-address to this one for reply with PGP key.

Mandalka commented 6 years ago

Thanks, helped a lot, since was not a problem with umlaut, but both entities with the umlaut had special chars around, like paraphrases or comma, so the whitespace tokenizer made "Firstname Lastname" to the words/tokens token "Firstname and token Lastname" or a comma after Name which was not token Name but token Name, and for the index a other word/token than Name ...

Solved by change to StandardTokenizerFactory which filters special chars like paraphrases or punctuation by https://github.com/opensemanticsearch/open-semantic-entity-search-api/commit/5b230853e31fd4a747489ddc4a873ee33b1eb70a

Will release new packages with rebuild of the schema of index of entities to use the new tokenizer tomorrow.

clamor commented 6 years ago

I've tested the patch. Works fine with Debian, no luck with ubuntu. I will stick with the Debian version for now, but I'll take a closer look on the ubuntu version. Maybe I'll find a hint.

Mandalka commented 6 years ago

Patch works only for dictionaries (-> analysis fields) which are new for the entities index Solr core.

Maybe existing installation/dictionaries in managed-schema on Ubuntu?

Todays release will rebuild the Solr schema analysis fields.

If not i'll try the new Release on Ubuntu today.

clamor commented 6 years ago

I've recreateted the solr indices on ubuntu - all entities get tagged now. Thanks.