opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
957 stars 166 forks source link

Multilingual search, OCR and document analysis #17

Closed Mandalka closed 6 years ago

Mandalka commented 7 years ago
Mandalka commented 7 years ago

Automatic language detection by tika-python

Mandalka commented 7 years ago

Split Solr synonym config writing to different languages so only synonyms of the language are considered and language specific stemming works for them, too.

Mandalka commented 7 years ago

Since my projects have domain specific vocabularies with few false friends with the other languages, splitting synonyms config files to languages not necessary yet, since we can use only one and the same synonyms config for all different languages/stemming analysis.

Mandalka commented 6 years ago

Automatic language detection implemented.

Mandalka commented 6 years ago

Named Entities Recognition by Stanford NER now parametered with document language specific classifiers.

Mandalka commented 6 years ago

Open Semantic ETL now supports multilingual index for additional language specific analysis, i.e. for language specific grammar/stemming or synonyms

Mandalka commented 6 years ago

Search UI (Solr-PHP-UI) support the new multilingual search index structure: It searches in/with multiple/language specific fields/stemmers/grammars and provides language specific highlighting.

Mandalka commented 6 years ago

Added additional out of the box languages/grammars which are used by the multilingual research projects for stemming.