opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Office documents get unzipped and indexed as multiple files #66

Closed bhelou closed 6 years ago

bhelou commented 6 years ago

Hi,

When OSS processes a file, it checks if it's a zip file (via enhance_zip plugin). If it is, the file gets unzipped and all files get indexed.

Office documents are zipped files, but it is undesirable to have them unzipped because then each document could get unzipped to dozens of useless files that clutter up the index. OSS realizes this. A safeguard is put in /etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-prefix: if a file is an office document, then don't unzip it.

However, the safeguard fails. There is a bug in the is_plugin_blacklisted_for_contenttype function in etl.py. The function checks the content type:

if 'content_type' in data:
    content_type = data['content_type']

This doesn't work because when the content type is extracted in Tika (in enhance_extract_text_tika_server.py), data is updated in the following way:

# copy Tika fields to (mapped) data fields
for tika_field in parsed["metadata"]:
    if tika_field in self.mapping:
        data[self.mapping[tika_field]] = parsed['metadata'][tika_field]
    else:
        data[tika_field + '_ss'] = parsed['metadata'][tika_field]

So the bug fix is to change

if 'content_type' in data:
    content_type = data['content_type']

to

if 'content_type_ss' in data:
    content_type_ss = data['content_type']

Thank you for the OSS software! Bassam

Mandalka commented 6 years ago

Thank you for the bug report!

The Tika content type field is mapped to content_type_ss and not the former field content_type which was former an one-value field and i forgot to change in the blacklist function in etl.py.

Blacklisting of plugins now should works with content types with multiple values, too, which will be part of new packages tomorrow or friday.

bhelou commented 6 years ago

Thanks! Your fix worked

yoann-mr commented 5 years ago

Hi, Is there a way to prevent text files extracted from an html document to be indexed as separate files?

The unwanted file id looks like this:

'/good_file.html/unwanted.txt', where '/good_file.html' is the id of the file we want to index

Thanks! Yoann