sul-dlss / speech-to-text

Tools for generating transcript and caption files from media files (e.g. a Docker container for running Whisper on video files in AWS ECS? 🤷🏽)
0 stars 0 forks source link

File specific language specification #51

Open edsu opened 4 days ago

edsu commented 4 days ago

speechToTextWF currently sends jobs to the speech-to-text service for an SDR Item, which may include multiple media files, and a set of options to use for all of them:

{
  "id": "gy983cn1444-v2",
  "media": [
    "snl_tomlin_phone_company_en.mp4",
    "snl_tomlin_phone_company_es.mp4"
  ],
  "options": {
    "language": "en"
  }
}

Whisper output can vary depending on the language option. Furthermore an SDR Item can have files with more than one languages. So we want users to be able to specify what language a specific file is transcribed in.

One suggested way of communicating that would be to turn the list of strings in media into a list of objects, which have an options property that allows you to override the options for the job as a whole.

In this example the job includes two files and they are processed using English and Spanish:

{
  "id": "gy983cn1444-v2",
  "media": [
    {
      "name": "snl_tomlin_phone_company_en.mp4",
      "options": {
        "language": "en"
      }
    },
    {
      "name": "snl_tomlin_phone_company_es.mp4",
      "options": {
        "language": "es"
      }
    }
  ]
}