Closed Archilegt closed 1 year ago
Thank you @Archilegt for finding this, @mlichtenberg, any thoughts what causes this problem?
What I can say is that the OCR text (which is what is fed into the GN services) was badly out of sync with the page images. What I cannot say with certainty is why.
It appears that the item (images, OCR, PDFs) was added to Internet Archive in 2010, and within a few days the images were replaced with a new set. I think that in the interim between the original upload and the corrected upload, BHL grabbed all of the images and text. And I think that when the updates were made at IA, the updated text was never brought into BHL. That is my best guess.
There are some processes in place today that are intended to keep everything in sync when images are updated. Unfortunately, the problem with this item originated 13 years ago. It was paginated in BHL in 2013, but the person performing the pagination apparently did not notice the problem, and it hadn't been touched since.
I have now ingested the correct set of OCR for this item into BHL, and the names contained within the text will be reindexed overnight tonight.
thanks @mlichtenberg! I guess we can close this one now
https://www.biodiversitylibrary.org/page/30747960#page/865/mode/1up looks correct now, thanks @mlichtenberg
Several names are matched in this blank page: https://www.biodiversitylibrary.org/page/30747960
Names matched in a few other pages seem unrelated to the content.
Please, reprocess the full volume.
Most importantly, where do the ghosts come from?