Figure out why page level search results do not work beyond first page

peetucket commented 3 months ago

There may be an issue with the XML page-splitting. I’m not getting search results beyond the first page.

Object created with new ocrWF: https://argo-qa.stanford.edu/view/druid:hj614hq2225 (not working) Same object created pre workcycle: https://argo-qa.stanford.edu/view/druid:sk191fb3287 (working)

Go to the PURL page for each and use the search function for 'computer'. New object has results on page 1 only, old object has results on lots of pages:

New: https://sul-purl-stage.stanford.edu/hj614hq2225 Old: https://sul-purl-stage.stanford.edu/sk191fb3287

Download page level XML files and compare to see what might be different.

Possible issue in how we split full document XML into page level. Possible issue with sul-embed search (though it works with old objects).

Another example object that had page level XML created by a splitting script during analysis phase that works: https://github.com/sul-dlss/common-accessioning/issues/1148#issuecomment-1989022826

See also #1279 (could be related)

peetucket commented 3 months ago

Examining the split XML files from the ocrWF, it appears the <Description> node and it's subnodes are only in the first file. Nokogiri appears to be removing them from the main doc after they are added to the first page. Not sure if this is important or not, but fixing by duping the nodes to be sure they are added to each page level XML file.

edsu commented 3 months ago

It may be unrelated but I'm also noticing that the workflow generated ALTO files have the <page> element at the XPATH /alto/Layout/Page whereas the older ones have it at /alto/Page. Since it seems to be finding words on the first page maybe this isn't an issue though?

I'm also noticing that the /alto/Styles element is present in the workflow generated OCR files for the first page, but is missing from subsequent pages.

peetucket commented 3 months ago

Note when testing:

Publish step needs to run to ensure content search gets the latest XML files. So ftping styles to the stacks is not sufficient, you need to republish too. Hopefully if we can correct the cocina-updater in ocrWF, we can just re-ocr things over and over again using the new Argo button after adjusting the split-xml robot instead of doing things manually anyway.
Even then, changes to content search may not occur immediately due to delayed solr commits. There is a way to manually trigger an immediate commit for a druid in content search and we may need to do this. See the bottom of the readme: https://github.com/sul-dlss/content_search

peetucket commented 3 months ago

Fixed by https://github.com/sul-dlss/common-accessioning/pull/1284 and https://github.com/sul-dlss/content_search/pull/530

sul-dlss / common-accessioning

Figure out why page level search results do not work beyond first page #1278