rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Future interface ideas #63

Open rdmpage opened 7 years ago

rdmpage commented 7 years ago

For a very different interface to historical texts see the UK Medical Heritage Library.

historical_texts

trosesandler commented 7 years ago

Rod

Thanks for sharing this site! I was familiar with this repository but this must be a new UI? There are some great ideas we could use for BHL including presentation of illustrations, full text search, etc.

Trish

On Sun, May 14, 2017 at 11:50 PM, Roderic Page notifications@github.com wrote:

For a very different interface to historical texts see the UK Medical Heritage Library https://ukmhl.historicaltexts.jisc.ac.uk/home.

[image: historical_texts] https://cloud.githubusercontent.com/assets/83306/26043054/4c54cf68-3932-11e7-944a-50d32472ce51.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBMu3n79CY2RTRTSkVelmPBphBIshks5r59mLgaJpZM4NaqcF .

rdmpage commented 7 years ago

@trosesandler Yes, I guess it would also be useful to have the NDSR folks have a play and see what they think.

trosesandler commented 7 years ago

Yep that's exactly who i forwarded the link to - some great ideas for their work! I also forwarded to the BHL tech team since we are currently in the process of implementing full text search and we still need to find a way to be able to search the image metadata we've been gathering via Flickr and Science Gossip.

Trish

On Mon, May 15, 2017 at 11:25 AM, Roderic Page notifications@github.com wrote:

@trosesandler https://github.com/trosesandler Yes, I guess it would also be useful to have the NDSR folks have a play and see what they think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/63#issuecomment-301527557, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBKijmXTVxZKYUoblK-KGwUVlcp1rks5r6HxggaJpZM4NaqcF .

rdmpage commented 7 years ago

Also relevant is Making Scanned Content Accessible Using Full-text Search and OCR

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:

adams080414image1

This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page

Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on wdl.org has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:

adams080414image2

The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

  1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:

adams080414image3

  1. JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7

Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

  1. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:

adams080414image4

  1. Finally, we add a partially transparent overlay over each highlighted word:

adams080414image5

Notes

adams080414image6

Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

rdmpage commented 7 years ago

@trosesandler I don't know if BHL has seen this post https://blogs.loc.gov/thesignal/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/ I've copied the contents into the the previous comment so it's here for reference.