sul-dlss / vt-arclight

An Arclight-based discovery application for materials from the Virtual Tribunals project
5 stars 2 forks source link

Analyze how to implement fulltext search of OCR #146

Closed marlo-longley closed 2 years ago

marlo-longley commented 2 years ago

Analysis needed -- to discuss as a group

example data here https://argo.stanford.edu/view/vc421jh1418 Is it the XML transcription that we want to index? We need custom indexing to get OCR data into Solr – do we do it at index time, do we write a new task for pulling more data? How did we do full text search in Spotlight?

corylown commented 2 years ago

Some references to similar functionality in exhibits:

This FullTextParser class knows how to extract full text information available at a purl: https://github.com/sul-dlss/exhibits/blob/master/app/models/full_text_parser.rb

The FullTextParser class is used in the traject configuration to extract full text from the purl: https://github.com/sul-dlss/exhibits/blob/master/lib/traject/dor_config.rb#L279

There's a design question too, although maybe we can mimic the approach taken in spotlight where the documents displayed on the index page show a link to search within the document via the embed viewer and a toggle-able sample of matches:

Screen Shot 2022-11-02 at 5 21 39 PM

For the click to search within the document using the embed widget, there's a helper that includes the value of the :search parameter in the data-embed-url attribute of the embed markup: https://github.com/sul-dlss/exhibits/blob/master/app/helpers/application_helper.rb#L57

So requests like this display the embed widget with a search & results: https://exhibits.stanford.edu/virtual-tribunals/catalog/bm424fc5843?search=Timor-Leste

Screen Shot 2022-11-02 at 5 20 21 PM
jcoyne commented 2 years ago

@corylown that approach works for English text (putting it in *_tesimv field). But we have documents with mixed languages. The stemming won't be appropriate for non-English.

We also have the complexity of now building a join across ETD documents and HTTP API calls to PURL. Do we have a good mechanism for doing this? Can this be done with Traject (e.g. with error-handling/resume/retry)? Do we need to build a more complex indexing pipeline?

It might be nice to have a pipeline that:

  1. first scans the doc for purl URLs
  2. then harvests the content metadata for each purl
  3. Harvests all of the page transcripts (for book type objects)
  4. Extracts all the strings from all of the transcripts and writes a text file
  5. Then does the regular EAD indexing and slurps in the text file if it exists.
corylown commented 2 years ago

@jcoyne right, this assumes we're supporting full text search for English-only. We need more information from @laurensorensen or others about requirements for this.

marlo-longley commented 2 years ago

2 types of text content we want to index: we have fulltext for scans/OCR of document books, and video transcripts. We will prioritize OCR fulltext search over video, and this ticket is about OCR indexing specifically.

Investigation into OCR XML language

even French is listed as en_US We can't differentiate between languages in OCR and that's OK for now. We will apply rules appropriate for English text to everything on the Solr side. If there are mismatches due to language, the text will still return in the search and the relevancy will be good enough for now.

Russian OCR is garbage (https://stacks.stanford.edu/file/druid:rt562jb5287/rt562jb5287_0002.xml): <String CONTENT="TPYBYHAJIA" HEIGHT="48" WIDTH="274" VPOS="1733" HPOS="1502"/> Let's ask Dinah about how to handle this -- or at least alert the relevant folks to the fact of bad data.

Steps:

  1. Harvester gets fulltext -- question: are we committing this data to the repo? Zip them and commit it so that our indexing is happening off static files?

  2. Write our own custom indexer for NTA -- how does overriding work? We need to be careful of the nested component indexing. Use similar pattern as folio traject config -- make a new config that imports all of Arclight's and changes a few details. The fulltext Solr field in Exhibits may be a multi-valued field with entries per page -- we don't care and are going to have a single valued field.

  3. Once we can get the right results in search, we will use the same approach as Exhibits to perform content search.

What are the UI implications?

Displaying search terms -- Exhibits solution needs the XML files to run Do we want to support both sample match and in-document search/match in the viewer?

2 pieces of functionality:

marlo-longley commented 2 years ago

Closing for now in favor of smaller chunked tickets for each step

152

153

154

155