Closed bhelou closed 6 years ago
Thanks for the bug report and analysis, which helps a lot.
I think Tika returns multiple values if multiple/different embedded files like embedded image files in an Powerpoint document or PDF, so i'll change Mapping of all Tika fields to multiple value field types.
I think the error occurs / occured in the case that there are embedded resources where more than one of them (or the original file had the extracted field Content-Length since former code was mapped to field with suffix _i which was one valued integer.
Now not one valued anymore.
Please test the package/release tomorrow with the PDF and reopen if yet a problem.
Thank you for all the infos!
Thanks! Your fix worked
Hi,
Some of my PDF files are corrupted in a weird way. When Tika indexes them, instead of classifying these files to be of MIME type application/pdf (as read from
parsed["metadata"]["Content-Type"]
inenhance_extract_text_tika_server.py
), it classifies them to be of multiple types (for example, ['application/pdf', 'application/pdf', 'image/png', 'image/png']). Unfortunately, I can't post the PDF files because they are confidential :( I am also not sure what causes them to have multiple MIME types.This then causes problems downstream in the enhance_contenttype_group plugin. Particularly, at
It throws an error because
data['content_type_ss']
is a list (and so doesn't have the startswith method). I've fixed this error by checking in enhance_extract_text_tika_server.py ifparsed["metadata"]["Content-Type"]
is a list:The above fixes the error in the enhance_contenttype_group plugin, but the PDF file still doesn't get indexed. Solr doesn't throw an error because the Content-Length category is a list. This results in the following fix
With the above code, Solr indexes the corrupted PDF file.
Would my proposed fix cause any issues with the rest of OSS? For instance, powerpoint files with multiple slides and images have multiple MIME types (which causes an error with the enhance_contenttype_group plugin, but Solr still indexes these powerpoint files). I've attached a powerpoint file as an example: test.pptx.
Do you have any better ideas on how to address this issue?
Thank you for the OSS software! Bassam