Extracted text data must be in utf-8

mjordan commented 2 months ago

Within the create_media() function, extracted text data must be encoded as utf-8:

 5135         # extracted_text media must have their field_edited_text field populated for full text indexing.
 5136         if media_type == "extracted_text":
 5137             if check_file_exists(config, filename):
 5138                 media_json["field_edited_text"] = list()
 5139                 if os.path.isabs(filename) is False:
 5140                     filename = os.path.join(config["input_dir"], filename)
 5141                 extracted_text_file = open(filename, "r", -1, "utf-8")
 5142                 media_json["field_edited_text"].append(
 5143                     {"value": extracted_text_file.read()}
 5144                 )
 5145             else:
 5146                 logging.error("Extracted text file %s not found.", filename)

If it is not, line 5143 produces exceptions like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 922: invalid start byte.

Short-term fix is to catch this error and not load the text.

mjordan commented 2 months ago

This also applies to media track files.

mjordan commented 3 days ago

It would be good to add to validate files as utf8, both in --check and non-check. Maybe provide a config setting so users can decide which files (based on media use tid?) are validated.

mjordan / islandora_workbench

Extracted text data must be in utf-8 #799