opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Setting Stemmer for unlisted languages #139

Open deeplearning101 opened 2 years ago

deeplearning101 commented 2 years ago

Hello, I'm interested in using opensemanticsearch to index documents in Norwegian. I see that Norwegian is not listed in setup http://[yourserver]/search-apps/setup/ in the Document Language section.

However, opensemanticsearch integrates SOLR and TIKA versions that support Norwegian and many other languages which are not covered by the opensemanticsearch officially supported languages.

Is it possible to manually set the configuration files to enable at least stemming (or other grammar-related features) for languages that are supported by SOLR but not listed in opensemanticsearch settings?

My need is just to search for PDFs and I have NO need to use all of the other language dependent features (e.g. named entity recognition, OCR, etc).

I think my request may be of general public interest since it would allow to extend opensemanticsearch users to people focused on unlisted languages in the official webpage.

Thank you in advance!