Future interface ideas - Githubissues

rdmpage commented 7 years ago

For a very different interface to historical texts see the UK Medical Heritage Library.

historical_texts

trosesandler commented 7 years ago

Rod

Thanks for sharing this site! I was familiar with this repository but this must be a new UI? There are some great ideas we could use for BHL including presentation of illustrations, full text search, etc.

Trish

On Sun, May 14, 2017 at 11:50 PM, Roderic Page notifications@github.com wrote:

For a very different interface to historical texts see the UK Medical Heritage Library https://ukmhl.historicaltexts.jisc.ac.uk/home.

[image: historical_texts] https://cloud.githubusercontent.com/assets/83306/26043054/4c54cf68-3932-11e7-944a-50d32472ce51.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBMu3n79CY2RTRTSkVelmPBphBIshks5r59mLgaJpZM4NaqcF .

rdmpage commented 7 years ago

@trosesandler Yes, I guess it would also be useful to have the NDSR folks have a play and see what they think.

trosesandler commented 7 years ago

Yep that's exactly who i forwarded the link to - some great ideas for their work! I also forwarded to the BHL tech team since we are currently in the process of implementing full text search and we still need to find a way to be able to search the image metadata we've been gathering via Flickr and Science Gossip.

Trish

On Mon, May 15, 2017 at 11:25 AM, Roderic Page notifications@github.com wrote:

@trosesandler https://github.com/trosesandler Yes, I guess it would also be useful to have the NDSR folks have a play and see what they think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/63#issuecomment-301527557, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBKijmXTVxZKYUoblK-KGwUVlcp1rks5r6HxggaJpZM4NaqcF .

rdmpage commented 7 years ago

Also relevant is Making Scanned Content Accessible Using Full-text Search and OCR

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:

adams080414image1

This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page

Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

A plain text version for the search engine, which does not understand HTML markup
A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer

Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on wdl.org has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:

adams080414image2

The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:

adams080414image3

JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7

Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:

adams080414image4

Finally, we add a partially transparent overlay over each highlighted word:

adams080414image5

Notes

The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.

adams080414image6

Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (�) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.

rdmpage commented 7 years ago

@trosesandler I don't know if BHL has seen this post https://blogs.loc.gov/thesignal/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/ I've copied the contents into the the previous comment so it's here for reference.

rdmpage / biostor

Future interface ideas #63

How it Works: From Scan to Web Page

Generating OCR Text

Indexing the Text for Search

Highlighting Results

Notes

Challenges & Future Directions