Closed clamor closed 6 years ago
Which is the exact content type and/or with encoding/charset of the affected document (facet "content type" in meta data or in sidebar on preview or in Solr content_type_ss)?
Would make debugging easier, since my first test on Debian with UTF-8 runned without problems. I'll test tomorrow on Ubuntu again but would like to try with same content type/encoding/charset.
solr says: "content_type_ss":["application/pdf"]
I gave debian a try, but it shows the same behaviour on debian/stretch64. Btw. OS locale is LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8".
If it helps, I can send the Document I used via private mail.
Would help a lot, if not sensitive, you can attach it in email to info@opensemanticsearch.org or send your email-address to this one for reply with PGP key.
Thanks, helped a lot, since was not a problem with umlaut, but both entities with the umlaut had special chars around, like paraphrases or comma, so the whitespace tokenizer made "Firstname Lastname" to the words/tokens token "Firstname and token Lastname" or a comma after Name which was not token Name but token Name, and for the index a other word/token than Name ...
Solved by change to StandardTokenizerFactory which filters special chars like paraphrases or punctuation by https://github.com/opensemanticsearch/open-semantic-entity-search-api/commit/5b230853e31fd4a747489ddc4a873ee33b1eb70a
Will release new packages with rebuild of the schema of index of entities to use the new tokenizer tomorrow.
I've tested the patch. Works fine with Debian, no luck with ubuntu. I will stick with the Debian version for now, but I'll take a closer look on the ubuntu version. Maybe I'll find a hint.
Patch works only for dictionaries (-> analysis fields) which are new for the entities index Solr core.
Maybe existing installation/dictionaries in managed-schema on Ubuntu?
Todays release will rebuild the Solr schema analysis fields.
If not i'll try the new Release on Ubuntu today.
I've recreateted the solr indices on ubuntu - all entities get tagged now. Thanks.
I've entered some concepts into the thesaurus with and without umlauts in the label. The concepts were part of different facets (persons, locations). All concepts make it correctly into the solr opensemanticsearch-entities index. Named entity recognition is disabled, document languages and ocr are set to english and german.
When indexing a new document (opensemantic-index-file) for the first time the document only gets tagged with the concepts without umlauts in the label and only these concepts are displayed in the facets. Looking in to the solr opensemanticsearch index the document is only tagged with the concepts without umlauts.
Hitting /search-apps/thesaurus/apply (Tag (new) documents) the document gets tagged with all concepts correctly. But when indexing the document again ("touch" and "opensemantic-index-file") the concepts with umlauts in label are gone again.
The concepts with the umlauts never make it into the neo4j db.