opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
962 stars 169 forks source link

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... #455

Open Aculo0815 opened 1 year ago

Aculo0815 commented 1 year ago

Hi, I installed the latest Opemsemanticsearch Version as deb-Package in my Ubuntu 22 LTS Hyper-V machine. I'd like to use OSS for our about 1700 docx documentations of non standard Feature of our Software. The indexing of the docs worked without any problems.

My Problem is: By Default all docx are tagged with multiple default tags, I think the came from the 'Apache solr"!?! Here some examples: grafik

Is it possible to deactived this tags in the apache solr? If tried the following, which didn't worked:

josefkarlkraus commented 1 year ago

Probably editing the file /etc/opensemanticsearch/etl solves your problem, especially by changing the lines regarding regex and by uncommenting lines with: enhance_extract_email enhance_extract_phone enhance_extract_law enhance_extract_money

feathered-arch commented 1 year ago

Y'know, I always wondered if anyone got those to work for them. I also disabled these because while it's a great concept, particularly when indexing things like the Panama Papers, it doesn't seem intelligent enough to properly parse things out without resorting to a lot of regex testing.

Aculo0815 commented 1 year ago

Great, it works. Thank's a lot. Now I'm ready to install it on a production VMware for my Dev-Team

Aculo0815 commented 1 year ago

#

ETL config for connector(s)

#

print debug messages

config['verbose'] = True

#

Languages for language specific index

#

Each document is analyzed without grammar rules in the index fields like content, additionally it can be added/copied to language specific index fields/analyzers

Document language is autodetected by default plugin enhance_detect_language_tika_server

If index support enhanced analytics for specific languages, we can add/copy data to language specific fields/analyzers

Set which languages are configured and shall be used in index for language specific analysis/stemming/synonyms

Default / if not set all languages that are supported will be analyzed additionally language specific

config['languages'] = ['en','de','fr','hu','it','pt','nl','cz','ro','ru','ar','fa']

force to language specific analysis additional in this language(s) grammar & synonyms, even if language autodetection detects other language

config['languages_force'] = ['en','de']

only use language for language specific analysis which are added / uncommented later

config['languages'] = []

add English

config['languages'].append('en')

add German / Deutsch

config['languages'].append('de')

add French / Francais

config['languages'].append('fr')

add Hungarian

config['languages'].append('hu')

add Spanish

config['languages'].append('es')

add Portuguese

config['languages'].append('pt')

add Italian

config['languages'].append('it')

add Czech

config['languages'].append('cz')

add Dutch

config['languages'].append('nl')

add Romanian

config['languages'].append('ro')

add Russian

config['languages'].append('ru')

#

Index/storage

#

#

Solr URL and port

#

config['export'] = 'export_solr'

Solr server

config['solr'] = 'http://localhost:8983/solr/'

Solr core

config['index'] = 'opensemanticsearch'

#

Elastic Search

#

config['export'] = 'export_elasticsearch'

Index

config['index'] = 'opensemanticsearch'

#

Tika for text and metadata extraction

#

Tika server (with tesseract-ocr-cache)

default: http://localhost:9998

config['tika_server'] = 'http://localhost:9998'

Tika server with fake OCR cache of tesseract-ocr-cache used if OCR in later ETL tasks

default: http://localhost:9999

config['tika_server_fake_ocr'] = 'http://localhost:9999'

#

Annotations

#

add plugin for annotation/tagging/enrichment of documents

config['plugins'].append('enhance_annotations')

set alternate URL of annotation server

config['metadata_server'] = 'http://localhost/search-apps/annotate/json'

#

RDF Knowledge Graph

#

add RDF Metadata Plugin for granular import of RDF file statements to entities of knowledge graphs

config['plugins'].append('enhance_rdf')

#

Config for OCR (automatic text recognition of text in images)

#

Disable OCR for image files (i.e for more performance and/or because you don't need the text within images or have only photos without photographed text)

config['ocr'] = False

Option to disable OCR of embedded images in PDF by Tika

so (if alternate plugin is enabled) OCR will be done only by alternate

plugin enhance_pdf_ocr (which else works only as fallback, if Tika exceptions)

config['ocr_pdf_tika'] = False

Use OCR cache

config['ocr_cache'] = '/var/cache/tesseract'

Option to disable OCR cache

config['ocr_cache'] = None

Do OCR for images embedded in PDF documents (i.e. designed images or scanned or photographed documents)

config['plugins'].append('enhance_pdf_ocr')

OCR language

If other than english you have to install package tesseract-XXX (tesseract language support) for your language

and set ocr_lang to this value (be careful, the tesseract package for english is "eng" (not "en") german is named "deu", not "de"!)

set OCR language to English/default

config['ocr_lang'] = 'eng'

set OCR language to German/Deutsch

config['ocr_lang'] = 'deu'

set multiple OCR languages

config['ocr_lang'] = 'eng+deu'

#

Regex pattern for extraction

#

Enable Regex plugin

config['plugins'].append('enhance_regex')

Regex config for IBAN extraction

config['regex_lists'].append('/etc/opensemanticsearch/regex/iban.tsv')

#

Email address and email domain extraction

#

config['plugins'].append('enhance_extract_email')

#

Phone number extraction

#

config['plugins'].append('enhance_extract_phone')

#

Config for Named Entities Recognition (NER) and Named Entity Linking (NEL)

#

Enable Entity Linking / Normalization and dictionary based Named Entities Extraction from thesaurus and ontologies

config['plugins'].append('enhance_entity_linking')

Enable SpaCy NER plugin

config['plugins'].append('enhance_ner_spacy')

Spacy NER Machine learning classifier (for which language and with which/how many classes)

Default classifier if no classifier for specific language

disable NER for languages where no classifier defined in config['spacy_ner_classifiers']

config['spacy_ner_classifier_default'] = None

Set default classifier to English (only if you are sure, that all documents you index are english)

config['spacy_ner_classifier_default'] = 'en_core_web_sm'

Set default classifier to German (only if you are sure, that all documents you index are german)

config['spacy_ner_classifier_default'] = 'de_core_news_sm'

Language specific classifiers (mapping to autodetected document language to Spacy classifier / language)

#

You have to download additional language classifiers for example english (en) or german (de) by

python3 -m spacy download en

python3 -m spacy download de

...

config['spacy_ner_classifiers'] = { 'da': 'da_core_news_sm', 'de': 'de_core_news_sm', 'en': 'en_core_web_sm', 'es': 'es_core_news_sm', 'fr': 'fr_core_news_sm', 'it': 'it_core_news_sm', 'lt': 'lt_core_news_sm', 'nb': 'nb_core_news_sm', 'nl': 'nl_core_news_sm', 'pl': 'pl_core_news_sm', 'pt': 'pt_core_news_sm', 'ro': 'ro_core_news_sm', }

Enable Stanford NER plugin

config['plugins'].append('enhance_ner_stanford')

Stanford NER Machine learning classifier (for which language and with how many classes, which need more computing time)

Default classifier if no classifier for specific language

disable NER for languages where no classifier defined in config['stanford_ner_classifiers']

config['stanford_ner_classifier_default'] = None

Set default classifier to English (only if you are sure, that all documents you index are english)

config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz'

Set default classifier to German (only if you are sure, that all documents you index are german)

config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz'

Language specific classifiers (mapping to autodetected document language)

Before you have to download additional language classifiers to the configured path

config['stanford_ner_classifiers'] = { 'en': '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'es': '/usr/share/java/stanford-ner/classifiers/spanish.ancora.distsim.s512.crf.ser.gz', 'de': '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz', }

If Stanford NER JAR not in standard path

config['stanford_ner_path_to_jar'] = "/usr/share/java/stanford-ner/stanford-ner.jar"

Stanford NER Java options like RAM settings

config['stanford_ner_java_options'] = '-mx1000m'

#

Law clauses extraction

#

config['plugins'].append('enhance_extract_law')

#

Money extraction

#

config['plugins'].append('enhance_extract_money')

#

Neo4j graph database

#

exports named entities and relations to Neo4j graph database

Enable plugin to export entities and connections to Neo4j graph database

config['plugins'].append('export_neo4j')

Neo4j server

config['neo4j_host'] = 'localhost'

Username & password

config['neo4j_user'] = 'xxx'

config['neo4j_password'] = 'xxx'

josefkarlkraus commented 1 year ago

I just realize, that i had to go a bit further to deactivate those as well: You can simply deactivate the Django facets for e.g. "phone", "currency" and so on. (maybe the procedure is a bit hacky but it works):

1) Create Django account cd /var/lib/opensemanticsearch python3 manage.py createsuperuser

2) Access Django web interface http://xxx.xxx.xxx.xxx/search-apps/admin/ >> Thesaurus >> Facets

3) Deactivate facets in web interface

deactivate all facets that you don't need by clicking on them and set: Enabled: "No", Snippets enabled: "No" Graph enabled: "No" SAVE