sul-dlss / speech-to-text

Tools for generating transcript and caption files from media files (e.g. a Docker container for running Whisper on video files in AWS ECS? 🤷🏽)
0 stars 0 forks source link

When using Whisper's auto-detected language, insert that language into the Cocina #45

Open andrewjbtw opened 4 hours ago

andrewjbtw commented 4 hours ago

When testing out captioning for Bengali language video, Whisper output text in what appears to be Bengali. (After checking with the curator for South Asian materials, it appears that we do not have anyone on staff who can read or speak Bengali.)

However, we are not yet applying a language tag to these caption files, which results in the display showing the language as English (the default). If we can get the language from Whisper, then we have a place to put it in the Cocina for the VTT file, like so:

"type": "https://cocina.sul.stanford.edu/models/file",
              "externalIdentifier": "https://cocina.sul.stanford.edu/file/9ae07267-1b89-40c3-a6b2-ad265894ab66",
              "label": "qf378nj5000_spa_cap.vtt",
              "filename": "qf378nj5000_spa_cap.vtt",
              "size": 54775,
              "version": 17,
              "hasMimeType": "text/vtt",
              "languageTag": "es",
              "use": "caption",

Note that even though "spa" (for Spanish) is in the filename of the vtt in that example, it's the languageTag field that makes the difference to the display.

Screenshot of current display:

Screenshot 2024-11-15 at 10 55 08 AM
jmartin-sul commented 2 hours ago

note: there might be work in this repo, to return the caption language in the "done" message that's queued when whisper completes, as well as in common-accessioning, to update the cocina accordingly. totally fine to split the common-accessioning work into a separate ticket that's blocked by this one, for whoever picks this one up.