opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
962 stars 169 forks source link

missing files #132

Open rafael844 opened 5 years ago

rafael844 commented 5 years ago

Hi I have 1850 files to index, but only 1768 were. Solr admin core says: numdocs: 10768 maxdocs:11069 deleteddocs:301

Is there a way/log file to see those deleted docs? Or see which ones didnt index and why? I saw that few splitted zip (zip.001, zip.002 ...) werent, and neither few p7s. Trying to open those p7s with a viewer it gives and error (Could not create a X509Certificate instance from DER data. Note: SUN's JCE provider only accepts a maximum RSA key length of 2048 bits, and the public key of this certificate might be longer. You can solve this by installing another JCE provider that supports larger key sizes. The underlying error message was: no more data allowed for version 1 certificate).

Could OSS index those files, ( just add the file and name) to solr ?? Im using OSS Desktop VM.

Thanks

YoannMR commented 5 years ago

Hi,

You could check the SOLR log at http://localhost:8983/solr/#/~logging (scroll down to get latest time) but it may not be a SOLR issue.

You could try re-indexing a file that failed while turning on verbose so that you get more information on what module is failing (posting the error message here may help as well)

opensemanticsearch-index-file your_file.p7s -v

rafael844 commented 5 years ago

here is a p7s, a zip and a pdf that failed. Trying with other indexing tool that uses elasticsearch it worked fine.

root@debian:/home/user# opensemanticsearch-index-file /media/sf_arquivos/12977/id_11199_AT_0193_2018.pdf.p7s -v Starting plugin enhance_mapping_id Starting plugin filter_blacklist Starting plugin filter_file_not_modified Indexing new file: /media/sf_arquivos/12977/id_11199_AT_0193_2018.pdf.p7s Starting plugin enhance_extract_text_tika_server Parsing by Tika Server on http://localhost:9998 2018-10-23 17:18:43,428 [MainThread ] [WARNI] Tika server returned status: 422 Exception while data enrichment of /media/sf_arquivos/12977/id_11199_AT_0193_2018.pdf.p7s with plugin enhance_extract_text_tika_server: 'content' Starting plugin enhance_detect_language_tika_server Calling Tika from http://localhost:9998/language/string Detected language: Starting plugin enhance_contenttype_group Exception while data enrichment of /media/sf_arquivos/12977/id_11199_AT_0193_2018.pdf.p7s with plugin enhance_contenttype_group: 'content_type_ss' Starting plugin enhance_pst Starting plugin enhance_csv Starting plugin enhance_file_mtime File modification time: 2018-10-16T19:49:50Z Starting plugin enhance_path Starting plugin enhance_extract_hashtags Starting plugin enhance_warc Exception while data enrichment of /media/sf_arquivos/12977/id_11199_AT_0193_2018.pdf.p7s with plugin enhance_warc: 'content_type_ss' Traceback (most recent call last): File "/usr/bin/opensemanticsearch-index-file", line 234, in connector.index(filename) File "/usr/bin/opensemanticsearch-index-file", line 109, in index self.index_file(filename=filename) File "/usr/bin/opensemanticsearch-index-file", line 182, in index_file parameters, data = self.process( parameters=parameters, data=data) File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 176, in process if self.is_plugin_blacklisted_for_contenttype(plugin, parameters, data): File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 117, in is_plugin_blacklisted_for_contenttype if filter_blacklist.is_in_list(filename=filename, value=content_type, match="prefix"): File "/usr/lib/python3/dist-packages/opensemanticetl/filter_blacklist.py", line 41, in is_in_list if value.startswith(line): AttributeError: 'NoneType' object has no attribute 'startswith'

root@debian:/home/user# opensemanticsearch-index-file /media/sf_arquivos/4492/id_4121ZH“QUANDO\ FALARAM\ DOS\ DANOS\ À\ SAÚDE\,\ CAIU\ A\ FICHA”\ \ ENTREVISTA.PDF -v Starting plugin enhance_mapping_id Starting plugin filter_blacklist Starting plugin filter_file_not_modified Indexing new file: /media/sf_arquivos/4492/id_4121ZH“QUANDO FALARAM DOS DANOS À SAÚDE, CAIU A FICHA” ENTREVISTA.PDF Starting plugin enhance_extract_text_tika_server Parsing by Tika Server on http://localhost:9998 Exception while data enrichment of /media/sf_arquivos/4492/id_4121ZH“QUANDO FALARAM DOS DANOS À SAÚDE, CAIU A FICHA” ENTREVISTA.PDF with plugin enhance_extract_text_tika_server: 'latin-1' codec can't encode character '\u201c' in position 31: ordinal not in range(256) Starting plugin enhance_detect_language_tika_server Calling Tika from http://localhost:9998/language/string Detected language: Starting plugin enhance_contenttype_group Exception while data enrichment of /media/sf_arquivos/4492/id_4121ZH“QUANDO FALARAM DOS DANOS À SAÚDE, CAIU A FICHA” ENTREVISTA.PDF with plugin enhance_contenttype_group: 'content_type_ss' Starting plugin enhance_pst Starting plugin enhance_csv Starting plugin enhance_file_mtime File modification time: 2018-10-16T19:46:40Z Starting plugin enhance_path Starting plugin enhance_extract_hashtags Starting plugin enhance_warc Exception while data enrichment of /media/sf_arquivos/4492/id_4121ZH“QUANDO FALARAM DOS DANOS À SAÚDE, CAIU A FICHA” ENTREVISTA.PDF with plugin enhance_warc: 'content_type_ss' Traceback (most recent call last): File "/usr/bin/opensemanticsearch-index-file", line 234, in connector.index(filename) File "/usr/bin/opensemanticsearch-index-file", line 109, in index self.index_file(filename=filename) File "/usr/bin/opensemanticsearch-index-file", line 182, in index_file parameters, data = self.process( parameters=parameters, data=data) File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 176, in process if self.is_plugin_blacklisted_for_contenttype(plugin, parameters, data): File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 117, in is_plugin_blacklisted_for_contenttype if filter_blacklist.is_in_list(filename=filename, value=content_type, match="prefix"): File "/usr/lib/python3/dist-packages/opensemanticetl/filter_blacklist.py", line 41, in is_in_list if value.startswith(line): AttributeError: 'NoneType' object has no attribute 'startswith'

root@debian:/home/user# opensemanticsearch-index-file /media/sf_arquivos/154/id_529_2013\ 05\ 10\ -\ Para\ XXX.zip -v Starting plugin enhance_mapping_id Starting plugin filter_blacklist Starting plugin filter_file_not_modified Indexing new file: /media/sf_arquivos/154/id_529_2013 05 10 - Para XXX.zip Starting plugin enhance_extract_text_tika_server Parsing by Tika Server on http://localhost:9998 2018-10-23 17:27:07,844 [MainThread ] [WARNI] Tika server returned status: 422 Exception while data enrichment of /media/sf_arquivos/154/id_529_2013 05 10 - Para XXX.zip with plugin enhance_extract_text_tika_server: 'content' Starting plugin enhance_detect_language_tika_server Calling Tika from http://localhost:9998/language/string Detected language: Starting plugin enhance_contenttype_group Exception while data enrichment of /media/sf_arquivos/154/id_529_2013 05 10 - Para XXX.zip with plugin enhance_contenttype_group: 'content_type_ss' Starting plugin enhance_pst Starting plugin enhance_csv Starting plugin enhance_file_mtime File modification time: 2018-10-16T19:45:23Z Starting plugin enhance_path Starting plugin enhance_extract_hashtags Starting plugin enhance_warc Exception while data enrichment of /media/sf_arquivos/154/id_529_2013 05 10 - Para XXX.zip with plugin enhance_warc: 'content_type_ss' Traceback (most recent call last): File "/usr/bin/opensemanticsearch-index-file", line 234, in connector.index(filename) File "/usr/bin/opensemanticsearch-index-file", line 109, in index self.index_file(filename=filename) File "/usr/bin/opensemanticsearch-index-file", line 182, in index_file parameters, data = self.process( parameters=parameters, data=data) File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 176, in process if self.is_plugin_blacklisted_for_contenttype(plugin, parameters, data): File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 117, in is_plugin_blacklisted_for_contenttype if filter_blacklist.is_in_list(filename=filename, value=content_type, match="prefix"): File "/usr/lib/python3/dist-packages/opensemanticetl/filter_blacklist.py", line 41, in is_in_list if value.startswith(line): AttributeError: 'NoneType' object has no attribute 'startswith'>

YoannMR commented 5 years ago

I don't have a clear answer for this :( But I did encounter a similar issue I believe. The best I can do is point you to what I'd check next.

I don't think the exceptions on enhance_extract_text_tika_server and enhance_contenttype_group are causing the issue (but I may be wrong). You could debug it using a software like visual studio code, index your pdf in debug mode and see line by line what's going on.

The codes are located at /usr/lib/python3/dist-packages/opensemanticetl/ (on a Ubuntu machine).

The plugin enhance_warc is causing an exception that creates an error when calling filter_blacklist.py which is complaining because the variable "value" seems to be empty.

I had a similar issue, I did not understand what was causing it but I tried to fix the error message by modifying the code filter_blacklist.py (at /usr/lib/python3/dist-packages/opensemanticetl/) by adding to function "is_in_list()" a safety check on value being empty as shown below.

You could give it a try and see if your document is indexed. If that works, we could ask for the fix to be pushed to the official code.

def is_in_list(filename, value, match=None):

        result = False
        listfile = open(filename)

        #added by YMR to fix bug
        if not value:
                listfile.close()
                return result
rafael844 commented 5 years ago

Thank you YoannMR. it worked, the files were insert into index. But with few issues

root@debian:/home/user# opensemanticsearch-index-file /home/user/Documents/sf_arquivos/12141/id_10090_Uma\ luz\ para\ quem\ sonha\ com\ o\ ’Jardins\ do\ pesadelo’.pdf -v Starting plugin enhance_mapping_id Starting plugin filter_blacklist Starting plugin filter_file_not_modified Indexing new file: /home/user/Documents/sf_arquivos/12141/id_10090_Uma luz para quem sonha com o ’Jardins do pesadelo’.pdf Starting plugin enhance_extract_text_tika_server Parsing by Tika Server on http://localhost:9998 Exception while data enrichment of /home/user/Documents/sf_arquivos/12141/id_10090_Uma luz para quem sonha com o ’Jardins do pesadelo’.pdf with plugin enhance_extract_text_tika_server: 'latin-1' codec can't encode character '\u2019' in position 60: ordinal not in range(256) Starting plugin enhance_detect_language_tika_server Calling Tika from http://localhost:9998/language/string Detected language: Starting plugin enhance_contenttype_group Exception while data enrichment of /home/user/Documents/sf_arquivos/12141/id_10090_Uma luz para quem sonha com o ’Jardins do pesadelo’.pdf with plugin enhance_contenttype_group: 'content_type_ss' Starting plugin enhance_pst Starting plugin enhance_csv Starting plugin enhance_file_mtime File modification time: 2018-10-16T19:49:21Z Starting plugin enhance_path Starting plugin enhance_extract_hashtags Starting plugin enhance_warc Exception while data enrichment of /home/user/Documents/sf_arquivos/12141/id_10090_Uma luz para quem sonha com o ’Jardins do pesadelo’.pdf with plugin enhance_warc: 'content_type_ss' Starting plugin enhance_zip Starting plugin clean_title Starting plugin enhance_multilingual Multilinguality: Copied path0_s to path0_s_txt_pt Multilinguality: Copied title_txt to title_txt_txt_pt Multilinguality: Copied path_basename_s to path_basename_s_txt_pt Multilinguality: Copied path2_s to path2_s_txt_pt Multilinguality: Copied path4_s to path4_s_txt_pt Multilinguality: Copied path3_s to path3_s_txt_pt Multilinguality: Copied path1_s to path1_s_txt_pt Starting plugin enhance_rdf_annotations_by_http_request Getting Meta from http://localhost/search-apps/annotate/rdf?uri=%2Fhome%2Fuser%2FDocuments%2Fsf_arquivos%2F12141%2Fid_10090_Uma+luz+para+quem+sonha+com+o+%E2%80%99Jardins+do+pesadelo%E2%80%99.pdf Meta graph has 0 statements. Checking Facet http://schema.org/address Checking Facet http://schema.org/Comment Checking Facet http://www.wikidata.org/entity/Q18810687 Checking Facet http://schema.org/Place Checking Facet http://www.wikidata.org/entity/Q5 Checking Facet http://www.wikidata.org/entity/Q2221906 Checking Facet http://www.wikidata.org/entity/Q178706 Checking Facet http://www.wikidata.org/entity/Q43229 Checking Facet http://schema.org/location Checking Facet http://schema.org/Organization Checking Facet http://schema.org/Person Checking Facet http://semantic-mediawiki.org/swivt/1.0#specialProperty_dat Checking Facet http://schema.org/keywords No semantic mediawiki modification date Starting plugin enhance_rdf Exception while data enrichment of /home/user/Documents/sf_arquivos/12141/id_10090_Uma luz para quem sonha com o ’Jardins do pesadelo’.pdf with plugin enhance_rdf: 'content_type_ss' Starting plugin enhance_regex Checking regex [\w.-]+@[\w.-]+ for facet email_ss Starting plugin enhance_entity_linking Dictionary matfches: {'tag_ss': []} Named Entity Linking: {} Sending update request to http://localhost:8983/solr/opensemanticsearch/update [{"etl_enhance_detect_language_tika_server_b": {"set": true}, "path2_s_txt_pt": {"set": "Documents"}, "path0_s": {"set": "home"}, "title_txt_txt_pt": {"set": "id_10090_Uma luz para quem sonha com o \u2019Jardins do pesadelo\u2019.pdf"}, "etl_enhance_pst_b": {"set": true}, "title_txt": {"set": "id_10090_Uma luz para quem sonha com o \u2019Jardins do pesadelo\u2019.pdf"}, "etl_filter_file_not_modified_b": {"set": true}, "path_basename_s_txt_pt": {"set": "id_10090_Uma luz para quem sonha com o \u2019Jardins do pesadelo\u2019.pdf"}, "etl_enhance_multilingual_b": {"set": true}, "id": "/home/user/Documents/sf_arquivos/12141/id_10090_Uma luz para quem sonha com o \u2019Jardins do pesadelo\u2019.pdf", "path_basename_s": {"set": "id_10090_Uma luz para quem sonha com o \u2019Jardins do pesadelo\u2019.pdf"}, "path2_s": {"set": "Documents"}, "etl_enhance_warc_b": {"set": true}, "etl_clean_title_b": {"set": true}, "etl_error_enhance_rdf_t": {"set": "'content_type_ss'"}, "etl_enhance_extract_text_tika_server_b": {"set": true}, "etl_enhance_mapping_id_b": {"set": true}, "etl_error_plugins_ss": {"set": ["enhance_extract_text_tika_server", "enhance_contenttype_group", "enhance_warc", "enhance_rdf"]}, "file_modified_dt": {"set": "2018-10-16T19:49:21Z"}, "etl_error_ss": {"set": ["'content_type_ss'"]}, "etl_enhance_file_mtime_b": {"set": true}, "etl_enhance_path_b": {"set": true}, "etl_filter_blacklist_b": {"set": true}, "etl_error_enhance_warc_t": {"set": "'content_type_ss'"}, "etl_error_enhance_contenttype_group_t": {"set": "'content_type_ss'"}, "etl_error_enhance_extract_text_tika_server_t": {"set": "'latin-1' codec can't encode character '\u2019' in position 60: ordinal not in range(256)"}, "etl_enhance_regex_b": {"set": true}, "etl_enhance_extract_hashtags_b": {"set": true}, "path0_s_txt_pt": {"set": "home"}, "etl_enhance_csv_b": {"set": true}, "path1_s_txt_pt": {"set": "user"}, "etl_enhance_contenttype_group_b": {"set": true}, "path3_s": {"set": "sf_arquivos"}, "language_s": {"set": ""}, "path4_s": {"set": "12141"}, "etl_enhance_entity_linking_b": {"set": true}, "etl_enhance_zip_b": {"set": true}, "path3_s_txt_pt": {"set": "sf_arquivos"}, "etl_enhance_rdf_b": {"set": true}, "etl_enhance_rdf_annotations_by_http_request_b": {"set": true}, "enhance_entity_linking_b": {"set": "true"}, "path4_s_txt_pt": {"set": "12141"}, "path1_s": {"set": "user"}}] Commiting cached or open transactions to index Committing to http://localhost:8983/solr/opensemanticsearch/update?commit=true root@debian:/home/user#

  • zip files were indexed but created one index to each files inside I guess. When I search for the file inside, or and word in the zip file name and the text inside the zip, it show two indexed files.

id_529_2013 05 10 - Para Dr. XXXX.zip 2018-10-16T19:45:23Z id_529_2013 05 10 - Para Dr. XXXX.zip

Re: RPPS 2018-10-25T12:26:18Z myemail@domain.com - (Re RPPS).eml in id_529_2013 05 10 - To Dr. XXXX.zip

The content of the zip file is blank, the eml (inside the zip) is fine.

I have limited programming skills to figure it out. Hope the OSS team could fix it.

YoannMR commented 5 years ago

it seems like the fix I proposed allowed bypassing the error message but did not fix content extraction from the documents.

There may be something wrong with the document content type (as the error log reports for the 4 plugins below).

"etl_error_plugins_ss": {"set": ["enhance_extract_text_tika_server", "enhance_contenttype_group", "enhance_warc", "enhance_rdf"]})

I do not know how to fix this. You'd need to run the indexing in debug mode on one of the failing files with a software like visual studio code to understand what's going on.

@Mandalka may be able to help

rafael844 commented 5 years ago

I deleted all index and ran it again, one thing that I noticed now is that using the default option of indexing from de OSS Desktop VM (with YoannMR fix) it indexed the zip file with blank content. But when I run it from commandline opensemantic-index file ...... it index the .zip file with blank content but also index the file inside ( and eml file). As above.

rafael844 commented 5 years ago

Is the opensemanticsearch-index-dir command different of opensemanticsearch-index-file ?? After I did the def is_in_list(filename, value, match=None): modificatoin as above, I noticed that with the command opensemanticsearch-index-file /home/documents/myfiles/* it indexes much more files than opensemanticsearch-index-dir /home/documents/myfiles/.

How can I put it as default command in VM OSS Desktop ??

Thanks.