hypothesis / via

Proxies third-party PDF files and HTML pages with the Hypothesis client embedded, so you can annotate them
https://via.hypothes.is/
BSD 2-Clause "Simplified" License
19 stars 7 forks source link

Spike: how can we select the transcript for a YouTube video? #1062

Closed seanh closed 1 year ago

seanh commented 1 year ago

A YouTube video can have multiple transcripts. For example there can be transcripts in different languages such as "English", "English (US)", "Spanish", etc. There can also be multiple transcripts for the same language, e.g. "English", "English - DTVCC1", "English - CC1", etc. There can also be manually-created and machine-generated transcripts, e.g. "English" and "English (auto-generated)".

The YouTube UI lists all the available transcripts and lets the user select exactly which transcript they want, including selecting between multiple transcripts for the same language:

image

image

Ideally we'd like Via (and LMS) to let the user select between the same list of transcripts that YouTube's own UI does, including letting the user select the exact transcript that they want even when there are multiple transcripts for the same language or manually-created and auto-generated transcripts for the same language.

How can we let the user select the transcript to use, given that we're using https://github.com/jdepoix/youtube-transcript-api to get the transcripts (and we probably don't want to replace that library right now as that would take some time)?

Some examples

jon-betts commented 1 year ago

I think this pretty easy. It's well supported by the API exposed by the library. Here is some example code:

from youtube_transcript_api import YouTubeTranscriptApi

def list_transcripts(video_id):
    for transcript in YouTubeTranscriptApi.list_transcripts(video_id):
        # This is just to demonstrate the interface
        yield {
            "name": transcript.language,
            "language_code": transcript.language_code,
            "is_autogenerated": transcript.is_generated,
            "translatable_to": transcript.translation_languages
        }

def get_transcript(video_id, name, translate_to=None):
    # This is how you'd probably get them
    for transcript in YouTubeTranscriptApi.list_transcripts(video_id):
        if transcript.language != name:
            continue

        if translate_to:
            transcript.translate(translate_to)

        return transcript

    raise ValueError("No such transcript!")

if __name__ == '__main__':
    video_id = 'qQ6a0iOzyHE'
    for transcript in list_transcripts(video_id):
        print(transcript)

    transcript = get_transcript(video_id, name='English (auto-generated)')
    transcript_text = transcript.fetch(preserve_formatting=False)
    print(transcript_text[:3])

In fact it's more flexible than we'd initially imagined. Many transcripts in one language can be downloaded translated into another. We can expose this functionality if we choose to, in order to let users change the transcript language into something they understand if no suitable transcript can be found.

What I don't know yet is:

seanh commented 1 year ago

This spike has been completed, see https://github.com/hypothesis/via/pull/1082. Note that @jon-betts's comment above is incorrect: the youtube-transcript-api libary's transcript listing is flawed, and we'll need to replace it with our own code