hypothesis / via

Proxies third-party PDF files and HTML pages with the Hypothesis client embedded, so you can annotate them
https://via.hypothes.is/
BSD 2-Clause "Simplified" License
19 stars 7 forks source link

Improve Via's automatic YouTube transcript language selection #1013

Closed seanh closed 1 year ago

seanh commented 1 year ago

Improve the algorithm that Via uses to select the transcript language for YouTube videos.

Currently (as of https://github.com/hypothesis/via/pull/1010) it uses https://github.com/jdepoix/youtube-transcript-api's default which is to pick the "first" English transcript and error if there are no English transcripts. I don't know exactly what ordering "first" implies but I think it prefers manually-created transcripts over machine-generated ones. For videos with no English transcripts I think we probably want it to pick one of the non-English transcripts rather than erroring, but how should it choose which non-English transcript if there are multiple?

Note that there's a separate issue (https://github.com/hypothesis/lms/issues/5406) about enabling instructors to see a list of the available languages and pick the one they want. That's not what this issue is about. This issue is about Via automatically choosing a transcript without any user input.

This fully automatic transcript selection is always going to be needed as a fallback: first because it will be needed for the feature to work before we've implemented manual transcript selection, and second because there are always going to be ways for users to bypass the manual transcript selection UI and view a YouTube video without having specified a transcript language. For example this will happen when annotating YouTube videos on public Via (if we ever enable this feature on public Via), on QA public Via (which some of us are using for testing this feature), if an LMS user creates an assignment and pastes a URL into the raw URL box rather than using the new YouTube picker, etc.

seanh commented 1 year ago

Closing in favour of https://github.com/hypothesis/via/issues/1110