mjordan / islandora_workbench

A command-line tool for managing content in an Islandora 2 repository
MIT License
24 stars 39 forks source link

Ingesting HOCR derivatives as a media attachment #592

Open dmer opened 1 year ago

dmer commented 1 year ago

This work is in support of the plans to provide search term highlighting in Mirador started by @alxp https://github.com/Islandora/islandora/pull/897

And continued by @patdunlavey here: https://github.com/Islandora/islandora_mirador/issues/17#issuecomment-1483402945

Functions like the OCR where the contents of the extracted text file are copied into a text field on the media (original_file) for indexing purposes.

USE CASE: I have a Islandora 7 repository with a very large amount of textual content in TIF file format - each page (TIF) has an associated HOCR file. I want to migrate the pages WITH their HOCR into Islandora.

I'd like to be able to batch in the HOCR files (either as part of the node-creating csv or as an add_media job) and have them attached to the appropriate file field on the media object.

Ideally I could pull these HOCR files directly from the Islandora7 datastream with a URL like I do for the OBJ (TIF) files.

Hopefully this is a clear definition of the ask - I'm happy to answer questions or add more details if requested.

mjordan commented 1 year ago

Related - #572.

dmer commented 1 year ago

Hi Mark - just checking in on this. I'm expecting to need this within a month or so. I can definitely volunteer some testing.