Closed guenterhack closed 6 years ago
There was yet an issue, if different users (for example you at command line and the worker (user opensemanticetl) used Open Semantic ETL since the library python-tika had an system wide persistent logfile from the first user starting to ETL a file which used only the users rights.
Now this issue https://github.com/opensemanticsearch/open-semantic-etl/issues/61 is solved by release which will be available today.
In this relase the rights of the ETL crontab will be setted explicit after installation, so even if the repository is not owned by root, the installation now should set the rights by chown root:root /etc/cron.d/open-semantic-search
No idea, why it finds only txt files, any changes on blacklists? Please reopen if this is not solved too by todays release.
Another thing on file rights: If you use the web config or parallel indexing by opensemanticsearch-index-dir instead of not parallel opensemanticsearch-index-file (which is running only on starting users rights so can read all files the user has access to) the worker running with rights/user opensemanticetl must have the right to read the files you want to index.
Thanks for your suggestions and your help! Unfortunately, the problem isn't solved. I've installed the new version over the old one. It installed just fine but it threw a host of Python errors at the end of the installing process:
Traceback (most recent call last):
File "/var/lib/opensemanticsearch/manage.py", line 10, in
Is there a way by which I could uninstall opensemanticsearch and start again from scratch without having to re-install the whole OS?
Seems like old config there and the config is corrupt since commented in the comment "print debug messages" which is only a comment for the option below.
You can reinstall the deb package so that the configs are overwritten (depends on package installer, should ask / see diff to current configs which changed over time) with default values, which should work and set the options in the web config ui.
Ah and sometimes (dependent on upgrades of Tika or the ETL workers or so) a reboot helps to start services again, since some init.d scripts have problems to restart on Debian by service name restart.
Thanks again! I deleted /etc/opensemanticsearch to get rid of old config files, then re-installed. Installation errors:
chgrp: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chgrp: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory Module wsgi already enabled System check identified some issues:
WARNINGS:chgrp: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chmod: cannot access '/etc/opensemanticsearch/ocr/dictionary.txt': No such file or directory chgrp: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory chmod: cannot access '/etc/opensemanticsearch/etl-webadmin': No such file or directory Module wsgi already enabled System check identified some issues:
WARNINGS: thesaurus.Concept.groups: (fields.W340) null has no effect on ManyToManyField.
thesaurus.Concept.groups: (fields.W340) null has no effect on ManyToManyField.
If I want to index my files, I get the following error:
Exception while data enrichment of (Path to File) with plugin filter_blacklist: 'FileNotFoundError' object has no attribute 'message'
I did not modify the ETL blacklists.
If you deleted old configs do a dpkg --purge open-semantic-search since the package manager will not reinstall/overwrite the old configs by default if not set by option. This configs (like blacklist) are missing. So purge, reinstall and reboot should help.
Purge doesn't delete everything, maybe that's the problem. Now I get no errors @ installing but indexing doesn't work. The system neither performs ETL nor builds an index.
Purge will remove the (yet deleted) configs from beeing managed by package manager so new installation will reinstall them instead of protect them)
After reinstall and reboot (since solr installation fir example by solr installer and not deb package so not all removed from /var/solr) it doesnt work?
Maybe its "only" slow because of named entity recognition by spac and ocr and neo4j which can be disabled by ui "Config"
No, I know what it looks like on the command line when it's working. Now if I perform an opensemanticsearch-index-dir (all rights are ok) it just says "indexing [file]" and runs through the directory but it doesn't build an index, it doesn't perform the usual tasks like ETL or OCR via tesseract etc.. That's interesting because it all worked perfectly on the same machine until I installed the "ETL update" last week.
opensemanticsearch-index-file is doing file after file in one process. New opensemanticsearch-index-dir ads the filenames to rabbt-mq message queue for beeing processed parallel by etl_tasks workers which run as user opensemanticetl so this user has to have access to the files. Maybe the files are not readable for this user?
OK, this did the trick: Purge the old install, delete all the directories not removed by purge, install the latest version, reboot. Indexing & ETL via opensemanticsearch-index-file now works and the web interface is populated correctly. Thank you again for your help!
OK, this is something exotic but maybe easily taken care of. I'm running the server version of opensemanticsearch on a ARM-based machine with linaro 8.7, a flavour of Debian. After some work, the .deb version of opensemanticsearch server used to work quite fine, but since the latest version (open-semantic-search_18.05.17.deb) indexing/extraction etc. don't work anymore, so it must have something to do with the changes in ETL. All components are in place, but if I want to index a directory from the command line, it just finds .txt files and doesn't add anything to the index. When I try to put a directory on the queue in the browser interface, syslog says that the cron process has the wrong owner. IMHO it could be a rights issue, but being a newbie to linux administration, it's hard to find out, so I'd be thankful for a tip where I should search for a solution.