jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

A video with non-English subtitles,translate(['en']) is null #218

Closed MaxwellEdisons closed 8 months ago

MaxwellEdisons commented 11 months ago

DO NOT DELETE THIS! Please take the time to fill this out properly. I am not able to help you if I do not know what you are executing and what error messages you are getting. If you are having problems with a specific video make sure to include the video id.

To Reproduce

A video with non-English subtitles, e.g. https://www.youtube.com/watch?v=aJ957mEAFis. returns an empty list when I use transcript.translate(['en']).fetch(). If I replace en with another language like 'de' it returns fine.

Which Python version are you using?

Python 3.11

Expected behavior

I'm not sure if this is the way the youtube interface returns or if there is a bug in the code. if you have time could you help check the problem?

t-naeem commented 11 months ago

This comment is written is 29-July-2023. It may or may not be valid anymore.

I have used your youtube video ID that you attached.

from youtube_transcript_api import YouTubeTranscriptApi
youtube_ID = "aJ957mEAFis"
transcript_list = YouTubeTranscriptApi.list_transcripts(youtube_ID)

When you print "transcript_list" variable as shown below: print(transcript_list)

It will return this:

For this video (aJ957mEAFis) transcripts are available in the following languages:

(MANUALLY CREATED)
 - zh-Hans ("Chinese (Simplified)")[TRANSLATABLE]
 - en ("English")[TRANSLATABLE]

(GENERATED)
None

(TRANSLATION LANGUAGES)
 - af ("Afrikaans")
 - ak ("Akan")
 - sq ("Albanian")
 - am ("Amharic")
 - ar ("Arabic")
 - hy ("Armenian")
 - as ("Assamese")
 - ay ("Aymara")
 - az ("Azerbaijani")
 - bn ("Bangla")
 - eu ("Basque")
 - be ("Belarusian")
 - bho ("Bhojpuri")
 - bs ("Bosnian")
 - bg ("Bulgarian")
 - my ("Burmese")
 - ca ("Catalan")
 - ceb ("Cebuano")
 - zh-Hans ("Chinese (Simplified)")
 - zh-Hant ("Chinese (Traditional)")
 - co ("Corsican")
 - hr ("Croatian")
 - cs ("Czech")
 - da ("Danish")
 - dv ("Divehi")
 - nl ("Dutch")
 - en ("English")
 - eo ("Esperanto")
 - et ("Estonian")
 - ee ("Ewe")
 - fil ("Filipino")
 - fi ("Finnish")
 - fr ("French")
 - gl ("Galician")
 - lg ("Ganda")
 - ka ("Georgian")
 - de ("German")
 - el ("Greek")
 - gn ("Guarani")
 - gu ("Gujarati")
 - ht ("Haitian Creole")
 - ha ("Hausa")
 - haw ("Hawaiian")
 - iw ("Hebrew")
 - hi ("Hindi")
 - hmn ("Hmong")
 - hu ("Hungarian")
 - is ("Icelandic")
 - ig ("Igbo")
 - id ("Indonesian")
 - ga ("Irish")
 - it ("Italian")
 - ja ("Japanese")
 - jv ("Javanese")
 - kn ("Kannada")
 - kk ("Kazakh")
 - km ("Khmer")
 - rw ("Kinyarwanda")
 - ko ("Korean")
 - kri ("Krio")
 - ku ("Kurdish")
 - ky ("Kyrgyz")
 - lo ("Lao")
 - la ("Latin")
 - lv ("Latvian")
 - ln ("Lingala")
 - lt ("Lithuanian")
 - lb ("Luxembourgish")
 - mk ("Macedonian")
 - mg ("Malagasy")
 - ms ("Malay")
 - ml ("Malayalam")
 - mt ("Maltese")
 - mi ("Māori")
 - mr ("Marathi")
 - mn ("Mongolian")
 - ne ("Nepali")
 - nso ("Northern Sotho")
 - no ("Norwegian")
 - ny ("Nyanja")
 - or ("Odia")
 - om ("Oromo")
 - ps ("Pashto")
 - fa ("Persian")
 - pl ("Polish")
 - pt ("Portuguese")
 - pa ("Punjabi")
 - qu ("Quechua")
 - ro ("Romanian")
 - ru ("Russian")
 - sm ("Samoan")
 - sa ("Sanskrit")
 - gd ("Scottish Gaelic")
 - sr ("Serbian")
 - sn ("Shona")
 - sd ("Sindhi")
 - si ("Sinhala")
 - sk ("Slovak")
 - sl ("Slovenian")
 - so ("Somali")
 - st ("Southern Sotho")
 - es ("Spanish")
 - su ("Sundanese")
 - sw ("Swahili")
 - sv ("Swedish")
 - tg ("Tajik")
 - ta ("Tamil")
 - tt ("Tatar")
 - te ("Telugu")
 - th ("Thai")
 - ti ("Tigrinya")
 - ts ("Tsonga")
 - tr ("Turkish")
 - tk ("Turkmen")
 - uk ("Ukrainian")
 - ur ("Urdu")
 - ug ("Uyghur")
 - uz ("Uzbek")
 - vi ("Vietnamese")
 - cy ("Welsh")
 - fy ("Western Frisian")
 - xh ("Xhosa")
 - yi ("Yiddish")
 - yo ("Yoruba")
 - zu ("Zulu")

Now notice that there are three types of transcripts available:

  1. Manually Created
  2. Generated
  3. Translation Languages

Notice that both English (en) and Chinese (zh-Hans) are available in Manually Created and Translation Languages. As a result, you can only access these two languages from Manually Created by following lines of code:

transcript_obtained_in_english = transcript_list.find_transcript(["en"]).fetch()
print(transcript_obtained_in_english)

transcript_obtained_in_chinese = transcript_list.find_transcript(["zh-Hans"]).fetch()
print(transcript_obtained_in_chinese)

However, you cannot access these two languages from Translated Languages, because they have the authors of this Github repository have made unique language codes for each language.

To access translated languages from Translation Languages list (e.g. Russian (ru)), you can write following command:

for transcription in transcript_list:
  transcript_obtained_in_russian = transcription.translate("ru").fetch()
print(transcript_obtained_in_russian)

Unfortunately if you write "en" in the last line of code, it will give an error. However' if you write "zh-Hans", it will work surprisingly. Then after this replace "zh-Hans" with "en", it will work this time. This is a very strange behavior that I couldn't figure out. May be because the original audio is in in "zh-Hans" when we translate that "eng" starts to work as well. It may be a minor issue in code logic.

Anyways, hope this solves your issue.

MaxwellEdisons commented 11 months ago

extremely grateful

jdepoix commented 8 months ago

Hi @MaxwellEdisons, Sorry for the late reply! @t-naeem already pointed out most things (thank you for that!), but I think the main reason this is happening is that when you call get_transcript without specifying a language code, 'en' will be used as a default. You did not provide the you use to fetch the transcript, but I would assume that you did not provide language code (in which case 'en' was used). So by calling transcript.translate("en") you're trying to translate english to english and it seem that YouTube returns an empty transcript if you do that. It will actually happen on the YouTube page as well if you set the language to english and then try to translate to english. I think I might add a check to the translate method in the future to make sure the target language is different from the source language, as translate from en to en doesn't really make sense I guess. So it's not too surprising that YouTube returns rubbish.