Closed marlo-longley closed 2 years ago
Some references to similar functionality in exhibits:
This FullTextParser
class knows how to extract full text information available at a purl:
https://github.com/sul-dlss/exhibits/blob/master/app/models/full_text_parser.rb
The FullTextParser
class is used in the traject configuration to extract full text from the purl:
https://github.com/sul-dlss/exhibits/blob/master/lib/traject/dor_config.rb#L279
There's a design question too, although maybe we can mimic the approach taken in spotlight where the documents displayed on the index page show a link to search within the document via the embed viewer and a toggle-able sample of matches:
For the click to search within the document using the embed widget, there's a helper that includes the value of the :search
parameter in the data-embed-url
attribute of the embed markup: https://github.com/sul-dlss/exhibits/blob/master/app/helpers/application_helper.rb#L57
So requests like this display the embed widget with a search & results: https://exhibits.stanford.edu/virtual-tribunals/catalog/bm424fc5843?search=Timor-Leste
@corylown that approach works for English text (putting it in *_tesimv
field). But we have documents with mixed languages. The stemming won't be appropriate for non-English.
We also have the complexity of now building a join across ETD documents and HTTP API calls to PURL. Do we have a good mechanism for doing this? Can this be done with Traject (e.g. with error-handling/resume/retry)? Do we need to build a more complex indexing pipeline?
It might be nice to have a pipeline that:
@jcoyne right, this assumes we're supporting full text search for English-only. We need more information from @laurensorensen or others about requirements for this.
2 types of text content we want to index: we have fulltext for scans/OCR of document books, and video transcripts. We will prioritize OCR fulltext search over video, and this ticket is about OCR indexing specifically.
even French is listed as en_US
We can't differentiate between languages in OCR and that's OK for now.
We will apply rules appropriate for English text to everything on the Solr side. If there are mismatches due to language, the text will still return in the search and the relevancy will be good enough for now.
Russian OCR is garbage (https://stacks.stanford.edu/file/druid:rt562jb5287/rt562jb5287_0002.xml): <String CONTENT="TPYBYHAJIA" HEIGHT="48" WIDTH="274" VPOS="1733" HPOS="1502"/>
Let's ask Dinah about how to handle this -- or at least alert the relevant folks to the fact of bad data.
Harvester gets fulltext -- question: are we committing this data to the repo? Zip them and commit it so that our indexing is happening off static files?
Write our own custom indexer for NTA -- how does overriding work? We need to be careful of the nested component indexing. Use similar pattern as folio traject config -- make a new config that imports all of Arclight's and changes a few details. The fulltext Solr field in Exhibits may be a multi-valued field with entries per page -- we don't care and are going to have a single valued field.
Once we can get the right results in search, we will use the same approach as Exhibits to perform content search.
Displaying search terms -- Exhibits solution needs the XML files to run Do we want to support both sample match and in-document search/match in the viewer?
2 pieces of functionality:
Search for "paper" in document text (link that knows about content-search / viewer). Uses different index / sul-embed
Sample matches in document text (normal Solr hit function) Solr docs on highlighting
Closing for now in favor of smaller chunked tickets for each step
Analysis needed -- to discuss as a group
example data here https://argo.stanford.edu/view/vc421jh1418 Is it the XML transcription that we want to index? We need custom indexing to get OCR data into Solr – do we do it at index time, do we write a new task for pulling more data? How did we do full text search in Spotlight?