samvera-labs / ramp

Interactive, IIIF powered audio/video media player React components library. Styleguidist Docs: https://samvera-labs.github.io/ramp/
https://ramp.avalonmediasystem.org/
30 stars 5 forks source link

Untimed Text Transcripts in Transcript Search Component #513

Closed joncameron closed 2 months ago

joncameron commented 4 months ago

Description

Untimed text files should be able to be searched just like timed text transcripts, but the search component developed by Third Wave doesn't account for untimed text material. Ramp should be able to support search and highlighting for these transcripts as well. Currently, results returned from the search service for untimed text aren't set up to be highlighted and loaded into the search results for navigation.

The current JS implementation by Third Wave may also not be able to index and query Word Docs (not designed for untimed text).

The tricky part in this implementation is figuring out how to use the search hits in non-timed text to work with previous/next button in the results navigator since they are not indexed in the JS code. Highlighting the search hits in the transcript display shouldn't be that hard as we could re-use the highlights in the search response to do this.

This can most likely wait until after 7.8 release.

Done Looks Like

elynema commented 3 months ago

Note that the lack of this feature can be somewhat confusing. For instance, a hit may be returned in search results due to a match in the .txt transcript, but the hit cannot be found using the search within feature on the media object page.

For example, search 'basketball' in the repository. One of the hits is: https://avalon-dev.dlib.indiana.edu/media_objects/x920fw86z

It has 3 transcripts. The first 2 webvtts are searchable, but do not contain the word basketball. The last txt transcript does, but it does not show a hit count next to it and there is no search within feature when you select that transcript.

This is potentially confusing for users.

elynema commented 3 months ago

Another example where the hit gets pulled up as a search result for transcript hits if you search 'indianapolis', but you can't actually search the 2 transcripts available: https://avalon-dev.dlib.indiana.edu/media_objects/9k41zd49s

elynema commented 3 months ago

For QA, we should test a variety of formatting:

Check that count and navigation is working properly.

Dananji commented 3 months ago

This can be tested on Ramp demo site

elynema commented 3 months ago

@Dananji I took a first pass on the demo site.

When I type in >1 search term, the hits are italicized, but not bolded with color change in the transcript text.

image.png
elynema commented 3 months ago

I also found one rather nasty .txt transcript that is not working well. See manifest: https://avalon-dev.dlib.indiana.edu/media_objects/g158bh28p/manifest.json. Select the 5th section (the .mp3). I've uploaded several non-timed transcripts. The third one "transcript (1) (1).txt" is one big chunk of text without line breaks. A search for 'whitaker' claims 188 hits next tot he transcript name in the drop-down, but then also says 'no results found'.

Presumably the issue is parsing through hits for a solid chunk of text?

image.png
elynema commented 3 months ago

Otherwise, tested functionality across Android, iOS, chrome, and safari and it seems to work. This change should be tested as a new Ramp build in Avalon as well.

Dananji commented 2 months ago

For the following search, content search response gives only 188 hits while there are 192 hits in the transcript text (from browser search using Cmd + F). Search 'whitaker' in the 5th section of https://avalon-dev.dlib.indiana.edu/media_objects/g158bh28p/

cjcolvar commented 2 months ago

For the following search, content search response gives only 188 hits while there are 192 hits in the transcript text (from browser search using Cmd + F). Search 'whitaker' in the 5th section of https://avalon-dev.dlib.indiana.edu/media_objects/g158bh28p/

This could be the solr query in Avalon reaching a limit. We may need to increase a threshold, adjust our query, or index differently. Can you write up a ticket in Avalon for this?

elynema commented 2 months ago

Dananji put in a new PR for this; the latest changes are in Ramp demo site, not Avalon yet.

joncameron commented 2 months ago

👍