Functions like the OCR where the contents of the extracted text file are copied into a text field on the media (original_file) for indexing purposes.
USE CASE: I have a Islandora 7 repository with a very large amount of textual content in TIF file format - each page (TIF) has an associated HOCR file. I want to migrate the pages WITH their HOCR into Islandora.
I'd like to be able to batch in the HOCR files (either as part of the node-creating csv or as an add_media job) and have them attached to the appropriate file field on the media object.
Ideally I could pull these HOCR files directly from the Islandora7 datastream with a URL like I do for the OBJ (TIF) files.
Hopefully this is a clear definition of the ask - I'm happy to answer questions or add more details if requested.
This work is in support of the plans to provide search term highlighting in Mirador started by @alxp https://github.com/Islandora/islandora/pull/897
And continued by @patdunlavey here: https://github.com/Islandora/islandora_mirador/issues/17#issuecomment-1483402945
Functions like the OCR where the contents of the extracted text file are copied into a text field on the media (original_file) for indexing purposes.
USE CASE: I have a Islandora 7 repository with a very large amount of textual content in TIF file format - each page (TIF) has an associated HOCR file. I want to migrate the pages WITH their HOCR into Islandora.
I'd like to be able to batch in the HOCR files (either as part of the node-creating csv or as an add_media job) and have them attached to the appropriate file field on the media object.
Ideally I could pull these HOCR files directly from the Islandora7 datastream with a URL like I do for the OBJ (TIF) files.
Hopefully this is a clear definition of the ask - I'm happy to answer questions or add more details if requested.