opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Changes in ETL #63

Closed guenterhack closed 6 years ago

guenterhack commented 6 years ago

OK, this is something exotic but maybe easily taken care of. I'm running the server version of opensemanticsearch on a ARM-based machine with linaro 8.7, a flavour of Debian. After some work, the .deb version of opensemanticsearch server used to work quite fine, but since the latest version (open-semantic-search_18.05.17.deb) indexing/extraction etc. don't work anymore, so it must have something to do with the changes in ETL. All components are in place, but if I want to index a directory from the command line, it just finds .txt files and doesn't add anything to the index. When I try to put a directory on the queue in the browser interface, syslog says that the cron process has the wrong owner. IMHO it could be a rights issue, but being a newbie to linux administration, it's hard to find out, so I'd be thankful for a tip where I should search for a solution.

Mandalka commented 6 years ago

There was yet an issue, if different users (for example you at command line and the worker (user opensemanticetl) used Open Semantic ETL since the library python-tika had an system wide persistent logfile from the first user starting to ETL a file which used only the users rights.

Now this issue https://github.com/opensemanticsearch/open-semantic-etl/issues/61 is solved by release which will be available today.

In this relase the rights of the ETL crontab will be setted explicit after installation, so even if the repository is not owned by root, the installation now should set the rights by chown root:root /etc/cron.d/open-semantic-search

No idea, why it finds only txt files, any changes on blacklists? Please reopen if this is not solved too by todays release.

Mandalka commented 6 years ago

Another thing on file rights: If you use the web config or parallel indexing by opensemanticsearch-index-dir instead of not parallel opensemanticsearch-index-file (which is running only on starting users rights so can read all files the user has access to) the worker running with rights/user opensemanticetl must have the right to read the files you want to index.

guenterhack commented 6 years ago

Thanks for your suggestions and your help! Unfortunately, the problem isn't solved. I've installed the new version over the old one. It installed just fine but it threw a host of Python errors at the end of the installing process:

Traceback (most recent call last): File "/var/lib/opensemanticsearch/manage.py", line 10, in execute_from_command_line(sys.argv) File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 367, in execute_from_command_line utility.execute() File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 359, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 294, in run_from_argv self.execute(*args, cmd_options) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 342, in execute self.check() File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 374, in check include_deployment_checks=include_deployment_checks, File "/usr/lib/python3/dist-packages/django/core/management/commands/migrate.py", line 62, in _run_checks issues.extend(super(Command, self)._run_checks(kwargs)) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 361, in _run_checks return checks.run_checks(kwargs) File "/usr/lib/python3/dist-packages/django/core/checks/registry.py", line 81, in run_checks new_errors = check(app_configs=app_configs) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 14, in check_url_config return check_resolver(resolver) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 24, in check_resolver for pattern in resolver.url_patterns: File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 313, in url_patterns patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module) File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 306, in urlconf_module return import_module(self.urlconf_name) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/opensemanticsearch/urls.py", line 10, in url(r'^api/', include('api.urls', namespace="api")), File "/usr/lib/python3/dist-packages/django/conf/urls/init.py", line 50, in include urlconf_module = import_module(urlconf_module) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/api/urls.py", line 3, in from api import views File "/var/lib/opensemanticsearch/api/views.py", line 7, in from opensemanticetl.tasks import enrich File "/usr/lib/python3/dist-packages/opensemanticetl/tasks.py", line 25, in etl_delete = Delete() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 18, in init self.read_configfiles() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 48, in read_configfiles self.read_configfile('/etc/opensemanticsearch/etl') File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 60, in read_configfile exec(open(configfile).read(), locals()) File "", line 7 print debug messages ^ IndentationError: Missing parentheses in call to 'print' mkdir: cannot create directory ‘/var/opensemanticsearch’: File exists Module wsgi already enabled Traceback (most recent call last): File "/var/lib/opensemanticsearch/manage.py", line 10, in execute_from_command_line(sys.argv) File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 367, in execute_from_command_line utility.execute() File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 359, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 294, in run_from_argv self.execute(args, cmd_options) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 342, in execute self.check() File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 374, in check include_deployment_checks=include_deployment_checks, File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 361, in _run_checks return checks.run_checks(kwargs) File "/usr/lib/python3/dist-packages/django/core/checks/registry.py", line 81, in run_checks new_errors = check(app_configs=app_configs) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 14, in check_url_config return check_resolver(resolver) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 24, in check_resolver for pattern in resolver.url_patterns: File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 313, in url_patterns patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module) File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 306, in urlconf_module return import_module(self.urlconf_name) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/opensemanticsearch/urls.py", line 10, in url(r'^api/', include('api.urls', namespace="api")), File "/usr/lib/python3/dist-packages/django/conf/urls/init.py", line 50, in include urlconf_module = import_module(urlconf_module) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/api/urls.py", line 3, in from api import views File "/var/lib/opensemanticsearch/api/views.py", line 7, in from opensemanticetl.tasks import enrich File "/usr/lib/python3/dist-packages/opensemanticetl/tasks.py", line 25, in etl_delete = Delete() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 18, in init self.read_configfiles() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 48, in read_configfiles self.read_configfile('/etc/opensemanticsearch/etl') File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 60, in read_configfile exec(open(configfile).read(), locals()) File "", line 7 Traceback (most recent call last): File "/var/lib/opensemanticsearch/manage.py", line 10, in execute_from_command_line(sys.argv) File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 367, in execute_from_command_line utility.execute() File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 359, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 294, in run_from_argv self.execute(args, cmd_options) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 342, in execute self.check() File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 374, in check include_deployment_checks=include_deployment_checks, File "/usr/lib/python3/dist-packages/django/core/management/commands/migrate.py", line 62, in _run_checks issues.extend(super(Command, self)._run_checks(kwargs)) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 361, in _run_checks return checks.run_checks(kwargs) File "/usr/lib/python3/dist-packages/django/core/checks/registry.py", line 81, in run_checks new_errors = check(app_configs=app_configs) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 14, in check_url_config return check_resolver(resolver) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 24, in check_resolver for pattern in resolver.url_patterns: File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 313, in url_patterns patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module) File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 306, in urlconf_module return import_module(self.urlconf_name) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/opensemanticsearch/urls.py", line 10, in url(r'^api/', include('api.urls', namespace="api")), File "/usr/lib/python3/dist-packages/django/conf/urls/init.py", line 50, in include urlconf_module = import_module(urlconf_module) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/api/urls.py", line 3, in from api import views File "/var/lib/opensemanticsearch/api/views.py", line 7, in from opensemanticetl.tasks import enrich File "/usr/lib/python3/dist-packages/opensemanticetl/tasks.py", line 25, in etl_delete = Delete() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 18, in init self.read_configfiles() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 48, in read_configfiles self.read_configfile('/etc/opensemanticsearch/etl') File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 60, in read_configfile exec(open(configfile).read(), locals()) File "", line 7 print debug messages ^ IndentationError: Missing parentheses in call to 'print' mkdir: cannot create directory ‘/var/opensemanticsearch’: File exists Module wsgi already enabled Traceback (most recent call last): File "/var/lib/opensemanticsearch/manage.py", line 10, in execute_from_command_line(sys.argv) File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 367, in execute_from_command_line utility.execute() File "/usr/lib/python3/dist-packages/django/core/management/init.py", line 359, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 294, in run_from_argv self.execute(*args, cmd_options) File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 342, in execute self.check() File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 374, in check include_deployment_checks=include_deployment_checks, File "/usr/lib/python3/dist-packages/django/core/management/base.py", line 361, in _run_checks return checks.run_checks(kwargs) File "/usr/lib/python3/dist-packages/django/core/checks/registry.py", line 81, in run_checks new_errors = check(app_configs=app_configs) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 14, in check_url_config return check_resolver(resolver) File "/usr/lib/python3/dist-packages/django/core/checks/urls.py", line 24, in check_resolver for pattern in resolver.url_patterns: File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 313, in url_patterns patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module) File "/usr/lib/python3/dist-packages/django/utils/functional.py", line 35, in get res = instance.dict[self.name] = self.func(instance) File "/usr/lib/python3/dist-packages/django/urls/resolvers.py", line 306, in urlconf_module return import_module(self.urlconf_name) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/opensemanticsearch/urls.py", line 10, in url(r'^api/', include('api.urls', namespace="api")), File "/usr/lib/python3/dist-packages/django/conf/urls/init.py", line 50, in include urlconf_module = import_module(urlconf_module) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 673, in exec_module File "", line 222, in _call_with_frames_removed File "/var/lib/opensemanticsearch/api/urls.py", line 3, in from api import views File "/var/lib/opensemanticsearch/api/views.py", line 7, in from opensemanticetl.tasks import enrich File "/usr/lib/python3/dist-packages/opensemanticetl/tasks.py", line 25, in etl_delete = Delete() File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 18, in init self.read_configfiles() debian./usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 48, in read_configfiles self.read_configfile('/etc/opensemanticsearch/etl') File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 60, in read_configfile exec(open(configfile).read(), locals()) File "", line 7


Is there a way by which I could uninstall opensemanticsearch and start again from scratch without having to re-install the whole OS?

Mandalka commented 6 years ago

Seems like old config there and the config is corrupt since commented in the comment "print debug messages" which is only a comment for the option below.

You can reinstall the deb package so that the configs are overwritten (depends on package installer, should ask / see diff to current configs which changed over time) with default values, which should work and set the options in the web config ui.

Mandalka commented 6 years ago

Ah and sometimes (dependent on upgrades of Tika or the ETL workers or so) a reboot helps to start services again, since some init.d scripts have problems to restart on Debian by service name restart.

guenterhack commented 6 years ago

Thanks again! I deleted /etc/opensemanticsearch to get rid of old config files, then re-installed. Installation errors:

chgrp: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chgrp: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory Module wsgi already enabled System check identified some issues:

WARNINGS:chgrp: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chgrp: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory Module wsgi already enabled System check identified some issues:

WARNINGS: thesaurus.Concept.groups: (fields.W340) null has no effect on ManyToManyField.

thesaurus.Concept.groups: (fields.W340) null has no effect on ManyToManyField.


If I want to index my files, I get the following error:

Exception while data enrichment of (Path to File) with plugin filter_blacklist: 'FileNotFoundError' object has no attribute 'message'

I did not modify the ETL blacklists.

Mandalka commented 6 years ago

If you deleted old configs do a dpkg --purge open-semantic-search since the package manager will not reinstall/overwrite the old configs by default if not set by option. This configs (like blacklist) are missing. So purge, reinstall and reboot should help.

guenterhack commented 6 years ago

Purge doesn't delete everything, maybe that's the problem. Now I get no errors @ installing but indexing doesn't work. The system neither performs ETL nor builds an index.

Mandalka commented 6 years ago

Purge will remove the (yet deleted) configs from beeing managed by package manager so new installation will reinstall them instead of protect them)

After reinstall and reboot (since solr installation fir example by solr installer and not deb package so not all removed from /var/solr) it doesnt work?

Maybe its "only" slow because of named entity recognition by spac and ocr and neo4j which can be disabled by ui "Config"

guenterhack commented 6 years ago

No, I know what it looks like on the command line when it's working. Now if I perform an opensemanticsearch-index-dir (all rights are ok) it just says "indexing [file]" and runs through the directory but it doesn't build an index, it doesn't perform the usual tasks like ETL or OCR via tesseract etc.. That's interesting because it all worked perfectly on the same machine until I installed the "ETL update" last week.

Mandalka commented 6 years ago

opensemanticsearch-index-file is doing file after file in one process. New opensemanticsearch-index-dir ads the filenames to rabbt-mq message queue for beeing processed parallel by etl_tasks workers which run as user opensemanticetl so this user has to have access to the files. Maybe the files are not readable for this user?

guenterhack commented 6 years ago

OK, this did the trick: Purge the old install, delete all the directories not removed by purge, install the latest version, reboot. Indexing & ETL via opensemanticsearch-index-file now works and the web interface is populated correctly. Thank you again for your help!