opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Tika bug that affects enhance_contenttype_group plugin #67

Closed bhelou closed 6 years ago

bhelou commented 6 years ago

Hi,

Some of my PDF files are corrupted in a weird way. When Tika indexes them, instead of classifying these files to be of MIME type application/pdf (as read from parsed["metadata"]["Content-Type"] in enhance_extract_text_tika_server.py), it classifies them to be of multiple types (for example, ['application/pdf', 'application/pdf', 'image/png', 'image/png']). Unfortunately, I can't post the PDF files because they are confidential :( I am also not sure what causes them to have multiple MIME types.

This then causes problems downstream in the enhance_contenttype_group plugin. Particularly, at

if data['content_type_ss'].startswith(contenttype):

It throws an error because data['content_type_ss'] is a list (and so doesn't have the startswith method). I've fixed this error by checking in enhance_extract_text_tika_server.py if parsed["metadata"]["Content-Type"] is a list:

if isinstance(parsed["metadata"]["Content-Type"], list):
    # tika.detector doesn't return a list
    from tika import detector
    parsed["metadata"]["Content-Type"] = detector.from_file(filename)

The above fixes the error in the enhance_contenttype_group plugin, but the PDF file still doesn't get indexed. Solr doesn't throw an error because the Content-Length category is a list. This results in the following fix

# sometimes parsed["metadata"]["Content-Type"] is a list, which causes problems downstream
if isinstance(parsed["metadata"]["Content-Type"], list):
    # tika.detector doesn't return a list
    from tika import detector
    parsed["metadata"]["Content-Type"] = detector.from_file(filename)

        # If "Content-Length" is a list, it causes an error with solr
    if "Content-Length" in parsed["metadata"].keys() and isinstance(parsed["metadata"]["Content-Length"], list):
        parsed["metadata"]["Content-Length"] = parsed["metadata"]["Content-Length"][0]

With the above code, Solr indexes the corrupted PDF file.

Would my proposed fix cause any issues with the rest of OSS? For instance, powerpoint files with multiple slides and images have multiple MIME types (which causes an error with the enhance_contenttype_group plugin, but Solr still indexes these powerpoint files). I've attached a powerpoint file as an example: test.pptx.

Do you have any better ideas on how to address this issue?

Thank you for the OSS software! Bassam

Mandalka commented 6 years ago

Thanks for the bug report and analysis, which helps a lot.

I think Tika returns multiple values if multiple/different embedded files like embedded image files in an Powerpoint document or PDF, so i'll change Mapping of all Tika fields to multiple value field types.

Mandalka commented 6 years ago

I think the error occurs / occured in the case that there are embedded resources where more than one of them (or the original file had the extracted field Content-Length since former code was mapped to field with suffix _i which was one valued integer.

Now not one valued anymore.

Please test the package/release tomorrow with the PDF and reopen if yet a problem.

Thank you for all the infos!

bhelou commented 6 years ago

Thanks! Your fix worked