OCR: Test splitting of full doc XML file into page level XML docs

peetucket commented 4 months ago

Currently, ABBY produces either a full page XML OCR doc OR page level XML OCR docs. We'd like to produce a full page XML doc, but the current delivery infrastructure in sul-embed, etc. requires page level XML OCR docs.

So we should test to see if we can take a full page XML doc and programmatically split in into page level XML docs that will be correctly understood by our current delivery infrastructure.

So:

Take a full doc XML.
Split into page level XML.
Accession full set of docs.
Test search and hit highlighting to see if split page level docs work as expected.

peetucket commented 4 months ago

Using this as a test: https://argo.stanford.edu/view/druid:ht121fv8052 since I was given a full object XML doc.

peetucket commented 4 months ago

@andrewjbtw do you know if the page level OCR filenames are important,

e.g. should they look like the ones in that object above: ht121fv8052_00_0015.xml etc etc

with a _00_ in the middle and the page number is always four digits?

andrewjbtw commented 4 months ago

I suspect they are because the XML will be used to superimpose the highlights on specific images and I don't know how else the connection will be made. But the pattern is not guaranteed to be like what this object has for its filenames. Does the full document XML have the image file names in it or are you having to supply those?

peetucket commented 4 months ago

I don't see any filename references in the full document XML for the pages... basically my test ruby script inputs the full XML doc, and then finds each <page> node to split off and create a new file from, the filename of which is currently based on the full document XML file plus the page number in some way. I am guessing that __00_0001 type pattern is the DPG style as shown in that object is just matching the filenames of the individual images that are part of the object.

Which I think raises the question...if the input to an accession is a single PDF, which is then run into OCR to produce a single XML file, then what is producing the page level PDFs we see in that object and how do they get named?

andrewjbtw commented 4 months ago

In this case, I ran ABBYY on a set of individual images twice:

once to get page-level files
once to get full document files

In theory, ABBYY could read and incorporate the page-level image names because it processes those files. But in practice there may not be a setting where we could get it to insert those into the document-level XML.

DPG has changed naming conventions over the years, plus not everything will be consistently named, so we can't rely on a pattern.

peetucket commented 4 months ago

Ok...I've go the full doc XML split into page level XML files, we can name them whatever we want (and the filename is easy to do as long as we can determine the pattern from the input doc filename). So next step is to try and accession these? I'm happy to put them somewhere on stage.

But I guess I am also trying to remind myself of the use case for needing to do this in the first place. Is that use case a single PDF input into ABBYY, which then generates a single OCR XML out? And it also generates the page level PDFs, but not the page level OCR XML?

andrewjbtw commented 4 months ago

The use case is to support the text search that we provide for digitized books and images that are processed as individual images. This is the part of the implementation where we're trying to reproduce the Goobi-ABBYY process we already have. In that process, ABBYY is given individual images and then produces individual XML and PDF files for each image. Then a script in Goobi stitches the PDFs into the full document PDF.

We could still do that in the new pipeline, but the PDF we get out of that process seriously lacks accessibility features since there's no view of the structure when stitching together individual files. That led us to look at "individual image --> document-level output," which gets us a better document PDF. The catch is that our text search seems to expect only individual image-level OCR files. If we can split the XML, then we can maintain that functionality. Alternatively, we would need to modify the search implementation, which maybe we should anyway, but which expands the scope of this work.

For PDF input, it's not clear that XML output will be put in use. Currently, we don't have an interface to interact with it, and we have not yet deployed the native PDF search, which would rely o the text-in-PDF and not the text-in-XML.

peetucket commented 3 months ago

OK, split complete. This google drive folder https://drive.google.com/drive/u/1/folders/1IO643ojB40V_cE9XxFIVBfPBSz6jcwZz has:

The full document PDF
The page level XML OCR files (split from the full document OCR file by my test ruby script).

So should we try to accession these sets of files to see if full text search still works as expected?

andrewjbtw commented 3 months ago

Yes - I'll pick those up and accession today.

andrewjbtw commented 3 months ago

Looks like the highlighting in search results is working:

https://sul-purl-stage.stanford.edu/cs799sz8450 https://argo-stage.stanford.edu/view/druid:cs799sz8450

peetucket commented 3 months ago

well that's promising

peetucket commented 3 months ago

Closing this as complete

sul-dlss / common-accessioning

OCR: Test splitting of full doc XML file into page level XML docs #1148