Closed lsloan closed 2 months ago
Hi @lsloan, this has already been discussed here, but the TLDR is
Since there can be multiple transcripts with different English dialects on a single video, we cannot simply fallback to any of them in case there is no en transcript, as this would require this module to implicitly select some over others. As a user of this module, you can choose to do so by doing something like YouTubeTranscriptApi.get_transcript('9-jIplX6Wjw', languages=['en', 'en-GB']).
I don't want to bake into this library that a certain english dialect should be preferred over others, so I'd rather have the user make that decision based on their requirements.
Hi, @jdepoix.
First of all, thank you for producing and maintaining this module. It's very helpful.
I appreciate the response and I partly agree with your logic. I think there should be a feature to give dialects of languages in the language_codes
precedence over other languages.
For example, let's say I specify the language list ('en', 'fr')
and the video that I am working with has transcripts with languages en-bz
and fr
. If I ask this module to get the transcript for the video, it will return the one for fr
. That's valid, because I specified fr
is a language I would accept. However, it's not ideal, because I specified en
has a higher precedence and a caption in that language was present.
This could be resolved by adding an option to find_transcript()
(and its variations) to ignore dialect/region subtags when comparing language tags. That is, when given the hypothetical example from above, this module would see the en-bz
caption of the video as the best match for the en
language I prefer. That might be implemented by checking for exact matches as it does now and if that fails, then check whether the video's transcript tags startswith
each of the languages in the list I specified.
In code, we could use this feature as…
YouTubeTranscriptApi.list_transcripts('qGulvsKFyvo')\
.find_manually_created_transcript(('en',), ignore_subtags=True)
In this case, it would return the en-US
transcript.
I see your point, but what would you do if there is a en-US
and a en-bz
transcript? In that case we'd have to implicitly chose one, without the user being aware of it (which I am not too fond of), or return multiple transcripts (which would require changing the interface for a very edge case thing).
I would say that whichever of the en-US
and en-bz
transcripts the YouTube API happens to return first is the one to return.
As a user of the module, if I prefer US English, to ensure a better outcome in that case I should use find_transcript()
with the languages ('en', 'en-us', 'fr')
.
In the other case that I described, I would still expect using that list of three languages to return the en-bz
transcript from the video that contains those for en-bz
and fr
.
DO NOT DELETE THIS! Please take the time to fill this out properly. I am not able to help you if I do not know what you are executing and what error messages you are getting. If you are having problems with a specific video make sure to include the video id.
To Reproduce
Steps to reproduce the behavior:
What code / cli command are you executing?
I am running…
Which Python version are you using?
Python 3.11
Which version of youtube-transcript-api are you using?
youtube-transcript-api 0.6.2
Expected behavior
Describe what you expected to happen.
I expected the module to find the English manually-created transcript. This video has a transcript labelled
en-US
(as seen in the error message below), which should be returned regardless of region code (US
in this case).Actual behaviour
Describe what is happening instead of the Expected behavior. Add error messages if there are any.
Instead, I received the following error message…