Closed bhelou closed 6 years ago
Thank you for the bug report!
The Tika content type field is mapped to content_type_ss and not the former field content_type which was former an one-value field and i forgot to change in the blacklist function in etl.py.
Blacklisting of plugins now should works with content types with multiple values, too, which will be part of new packages tomorrow or friday.
Thanks! Your fix worked
Hi, Is there a way to prevent text files extracted from an html document to be indexed as separate files?
The unwanted file id looks like this:
'/good_file.html/unwanted.txt', where '/good_file.html' is the id of the file we want to index
Thanks! Yoann
Hi,
When OSS processes a file, it checks if it's a zip file (via enhance_zip plugin). If it is, the file gets unzipped and all files get indexed.
Office documents are zipped files, but it is undesirable to have them unzipped because then each document could get unzipped to dozens of useless files that clutter up the index. OSS realizes this. A safeguard is put in /etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-prefix: if a file is an office document, then don't unzip it.
However, the safeguard fails. There is a bug in the is_plugin_blacklisted_for_contenttype function in etl.py. The function checks the content type:
This doesn't work because when the content type is extracted in Tika (in enhance_extract_text_tika_server.py), data is updated in the following way:
So the bug fix is to change
to
Thank you for the OSS software! Bassam