Open Aculo0815 opened 1 year ago
Probably editing the file /etc/opensemanticsearch/etl solves your problem, especially by changing the lines regarding regex and by uncommenting lines with: enhance_extract_email enhance_extract_phone enhance_extract_law enhance_extract_money
Y'know, I always wondered if anyone got those to work for them. I also disabled these because while it's a great concept, particularly when indexing things like the Panama Papers, it doesn't seem intelligent enough to properly parse things out without resorting to a lot of regex testing.
Great, it works. Thank's a lot. Now I'm ready to install it on a production VMware for my Dev-Team
# -*- coding: utf-8 -*-
#
#
#
#
#
#
#
#
config['export'] = 'export_solr'
config['solr'] = 'http://localhost:8983/solr/'
config['index'] = 'opensemanticsearch'
#
#
#
#
#
#
config['plugins'].append('enhance_annotations')
#
#
config['plugins'].append('enhance_rdf')
#
#
config['ocr_cache'] = '/var/cache/tesseract'
config['plugins'].append('enhance_pdf_ocr')
config['ocr_lang'] = 'eng+deu'
#
#
config['plugins'].append('enhance_regex')
#
#
#
#
#
#
config['plugins'].append('enhance_entity_linking')
config['plugins'].append('enhance_ner_spacy')
config['spacy_ner_classifier_default'] = None
#
config['spacy_ner_classifiers'] = { 'da': 'da_core_news_sm', 'de': 'de_core_news_sm', 'en': 'en_core_web_sm', 'es': 'es_core_news_sm', 'fr': 'fr_core_news_sm', 'it': 'it_core_news_sm', 'lt': 'lt_core_news_sm', 'nb': 'nb_core_news_sm', 'nl': 'nl_core_news_sm', 'pl': 'pl_core_news_sm', 'pt': 'pt_core_news_sm', 'ro': 'ro_core_news_sm', }
config['stanford_ner_classifier_default'] = None
config['stanford_ner_classifiers'] = { 'en': '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'es': '/usr/share/java/stanford-ner/classifiers/spanish.ancora.distsim.s512.crf.ser.gz', 'de': '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz', }
config['stanford_ner_path_to_jar'] = "/usr/share/java/stanford-ner/stanford-ner.jar"
config['stanford_ner_java_options'] = '-mx1000m'
#
#
#
#
#
#
I just realize, that i had to go a bit further to deactivate those as well: You can simply deactivate the Django facets for e.g. "phone", "currency" and so on. (maybe the procedure is a bit hacky but it works):
1) Create Django account cd /var/lib/opensemanticsearch python3 manage.py createsuperuser
2) Access Django web interface http://xxx.xxx.xxx.xxx/search-apps/admin/ >> Thesaurus >> Facets
3) Deactivate facets in web interface
deactivate all facets that you don't need by clicking on them and set: Enabled: "No", Snippets enabled: "No" Graph enabled: "No" SAVE
Hi, I installed the latest Opemsemanticsearch Version as deb-Package in my Ubuntu 22 LTS Hyper-V machine. I'd like to use OSS for our about 1700 docx documentations of non standard Feature of our Software. The indexing of the docs worked without any problems.
My Problem is: By Default all docx are tagged with multiple default tags, I think the came from the 'Apache solr"!?! Here some examples:
Is it possible to deactived this tags in the apache solr? If tried the following, which didn't worked:
I have looked inside the '/var/opensemanticsearch/db' sqllite DB, too, but didn't find something useful
Did anyone has a hint, to get rid of the default tags?