jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.54k stars 279 forks source link

language selection is not region agnostic #283

Closed lsloan closed 2 months ago

lsloan commented 2 months ago

DO NOT DELETE THIS! Please take the time to fill this out properly. I am not able to help you if I do not know what you are executing and what error messages you are getting. If you are having problems with a specific video make sure to include the video id.

To Reproduce

Steps to reproduce the behavior:

YouTubeTranscriptApi.list_transcripts('qGulvsKFyvo').find_manually_created_transcript(('en',))

What code / cli command are you executing?

I am running…

YouTubeTranscriptApi.list_transcripts('qGulvsKFyvo').find_manually_created_transcript(('en',))

Which Python version are you using?

Python 3.11

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.2

Expected behavior

Describe what you expected to happen.

I expected the module to find the English manually-created transcript. This video has a transcript labelled en-US (as seen in the error message below), which should be returned regardless of region code (US in this case).

Actual behaviour

Describe what is happening instead of the Expected behavior. Add error messages if there are any.

Instead, I received the following error message…

youtube_transcript_api._errors.NoTranscriptFound: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=qGulvsKFyvo! This is most likely caused by:

No transcripts were found for any of the requested language codes: ('en',)

For this video (qGulvsKFyvo) transcripts are available in the following languages:

(MANUALLY CREATED)
 - en-US ("English (United States)")[TRANSLATABLE]

(GENERATED)
 - en ("English (auto-generated)")[TRANSLATABLE]

(TRANSLATION LANGUAGES)
jdepoix commented 2 months ago

Hi @lsloan, this has already been discussed here, but the TLDR is

Since there can be multiple transcripts with different English dialects on a single video, we cannot simply fallback to any of them in case there is no en transcript, as this would require this module to implicitly select some over others. As a user of this module, you can choose to do so by doing something like YouTubeTranscriptApi.get_transcript('9-jIplX6Wjw', languages=['en', 'en-GB']).

I don't want to bake into this library that a certain english dialect should be preferred over others, so I'd rather have the user make that decision based on their requirements.

lsloan commented 2 months ago

Hi, @jdepoix.

First of all, thank you for producing and maintaining this module. It's very helpful.

I appreciate the response and I partly agree with your logic. I think there should be a feature to give dialects of languages in the language_codes precedence over other languages.

For example, let's say I specify the language list ('en', 'fr') and the video that I am working with has transcripts with languages en-bz and fr. If I ask this module to get the transcript for the video, it will return the one for fr. That's valid, because I specified fr is a language I would accept. However, it's not ideal, because I specified en has a higher precedence and a caption in that language was present.

This could be resolved by adding an option to find_transcript() (and its variations) to ignore dialect/region subtags when comparing language tags. That is, when given the hypothetical example from above, this module would see the en-bz caption of the video as the best match for the en language I prefer. That might be implemented by checking for exact matches as it does now and if that fails, then check whether the video's transcript tags startswith each of the languages in the list I specified.

In code, we could use this feature as…

YouTubeTranscriptApi.list_transcripts('qGulvsKFyvo')\
  .find_manually_created_transcript(('en',), ignore_subtags=True)

In this case, it would return the en-US transcript.

jdepoix commented 2 months ago

I see your point, but what would you do if there is a en-US and a en-bz transcript? In that case we'd have to implicitly chose one, without the user being aware of it (which I am not too fond of), or return multiple transcripts (which would require changing the interface for a very edge case thing).

lsloan commented 1 month ago

I would say that whichever of the en-US and en-bz transcripts the YouTube API happens to return first is the one to return.

As a user of the module, if I prefer US English, to ensure a better outcome in that case I should use find_transcript() with the languages ('en', 'en-us', 'fr').

In the other case that I described, I would still expect using that list of three languages to return the en-bz transcript from the video that contains those for en-bz and fr.