Closed peetucket closed 3 months ago
Examining the split XML files from the ocrWF, it appears the <Description>
node and it's subnodes are only in the first file. Nokogiri appears to be removing them from the main doc after they are added to the first page. Not sure if this is important or not, but fixing by duping the nodes to be sure they are added to each page level XML file.
It may be unrelated but I'm also noticing that the workflow generated ALTO files have the <page>
element at the XPATH /alto/Layout/Page
whereas the older ones have it at /alto/Page
. Since it seems to be finding words on the first page maybe this isn't an issue though?
I'm also noticing that the /alto/Styles
element is present in the workflow generated OCR files for the first page, but is missing from subsequent pages.
Note when testing:
There may be an issue with the XML page-splitting. I’m not getting search results beyond the first page.
Object created with new ocrWF: https://argo-qa.stanford.edu/view/druid:hj614hq2225 (not working) Same object created pre workcycle: https://argo-qa.stanford.edu/view/druid:sk191fb3287 (working)
Go to the PURL page for each and use the search function for 'computer'. New object has results on page 1 only, old object has results on lots of pages:
New: https://sul-purl-stage.stanford.edu/hj614hq2225 Old: https://sul-purl-stage.stanford.edu/sk191fb3287
Download page level XML files and compare to see what might be different.
Possible issue in how we split full document XML into page level. Possible issue with sul-embed search (though it works with old objects).
Another example object that had page level XML created by a splitting script during analysis phase that works: https://github.com/sul-dlss/common-accessioning/issues/1148#issuecomment-1989022826
See also #1279 (could be related)