learningequality / pressurecooker

A library of various media and content processing utilities for use in Ricecooker
MIT License
3 stars 9 forks source link

Subtitles formats and compatible youtube_language code checks #28

Closed ivanistheone closed 4 years ago

ivanistheone commented 6 years ago

This is a sample data from the substitles info returned by yt_resource.get_resource_subtitles() for this video.

{
  "en": [
    {
      "url": "https://www.youtube.com/api/timedtext?lang=en&v=FN12ty5ztAs&fmt=ttml&name=+via+Dotsub",
      "ext": "ttml"
    },
    {
      "url": "https://www.youtube.com/api/timedtext?lang=en&v=FN12ty5ztAs&fmt=vtt&name=+via+Dotsub",
      "ext": "vtt"
    }
  ],
  "fr": [
    {
      "url": "https://www.youtube.com/api/timedtext?lang=fr&v=FN12ty5ztAs&fmt=ttml&name=+via+Dotsub",
      "ext": "ttml"
    },
    {
      "url": "https://www.youtube.com/api/timedtext?lang=fr&v=FN12ty5ztAs&fmt=vtt&name=+via+Dotsub",
      "ext": "vtt"
    }
  ],
  "zu": [
    {
      "url": "https://www.youtube.com/api/timedtext?lang=zu&v=FN12ty5ztAs&fmt=ttml&name=+via+Dotsub",
      "ext": "ttml"
    },
    {
      "url": "https://www.youtube.com/api/timedtext?lang=zu&v=FN12ty5ztAs&fmt=vtt&name=+via+Dotsub",
      "ext": "vtt"
    }
  ]
}

Upstream code will have to be aware of the following data issues:

  1. Must process only the ext=vtt subs and ignore the ext=ttml ones
  2. Must check for compatibility of the youtube language code with our internal representation, as defiend in le-utils. Ricecooker uses the function is_youtube_subtitle_file_supported_language which would make sense to move to pressurecooker so can be used here too. For example is_youtube_subtitle_file_supported_language('zu') returns True since zu can be mapped to internal language code zul.
  3. For incompatible languages---should skip subtitle file, raise a warning, and send an email to admins@studio so we'll know about it and can add to le-utils. In ricecooker someone is checking the logs so there is a human in the loop that can see when we run into an incompatible language code, but if youtube import is running as a background task on studio we won't know about it. (maybe also tell the user that we failed to import certain languages, but not that useful since there is nothing they can do about it--it's LE admins job to add lang code to le-utils).
  4. We have to map youtube language code to internal language representation before creating the subtitles files on Studio. Ricecooker provides another helper function _get_language_with_alpha2_fallback for this purpose -- returns the language_object. For example calling _get_language_with_alpha2_fallback('zu') returns Language(native_name='isiZulu', primary_code='zul', subcode=None, name='Zulu', ka_name=None).

@kollivier I'm going to open a PR to move these functions to pressurecooker, so you'll have them available once you start working on the Studio youtube import functionality.

ivanistheone commented 4 years ago

The helper method is_youtube_subtitle_file_supported_language is now available:

to determine if youtube language code can be mapped to one of the internal language codes in le-utils