scientist-softserv / britishlibrary

Other
3 stars 0 forks source link

Recombobulator to enable OCR text for full pdf to be downloaded #446

Open grahamjevon opened 1 year ago

grahamjevon commented 1 year ago

Since the IIIF print deployment, the OCR of a single pdf page can be downloaded as .txt .json or .xml.

Image

I expect that most users who want to download the OCR text would want to do this for the whole pdf file and downloading 1 page at a time would probably be onerous. If there does prove to be user demand for the OCR text, it would probably be helpful if we can enable users to download the OCR text .txt .json or .xml for all pages of a pdf or all pages of a work.

In Slack question about whether this was currently possible, Jeremy explained: 'That is presently not a feature; as we don’t have a “recombobulator” of the constituent parts'.

grahamjevon commented 1 year ago

Given that #475 means that the pdf page child works will now be hidden from the public view, this means the ability to download even an individual OCR page is no longer available. This probably increases the requirement for an option to download the OCR of a PDF from the parent page.