Closed MaxwellEdisons closed 8 months ago
This comment is written is 29-July-2023. It may or may not be valid anymore.
I have used your youtube video ID that you attached.
from youtube_transcript_api import YouTubeTranscriptApi
youtube_ID = "aJ957mEAFis"
transcript_list = YouTubeTranscriptApi.list_transcripts(youtube_ID)
When you print "transcript_list" variable as shown below:
print(transcript_list)
It will return this:
For this video (aJ957mEAFis) transcripts are available in the following languages:
(MANUALLY CREATED)
- zh-Hans ("Chinese (Simplified)")[TRANSLATABLE]
- en ("English")[TRANSLATABLE]
(GENERATED)
None
(TRANSLATION LANGUAGES)
- af ("Afrikaans")
- ak ("Akan")
- sq ("Albanian")
- am ("Amharic")
- ar ("Arabic")
- hy ("Armenian")
- as ("Assamese")
- ay ("Aymara")
- az ("Azerbaijani")
- bn ("Bangla")
- eu ("Basque")
- be ("Belarusian")
- bho ("Bhojpuri")
- bs ("Bosnian")
- bg ("Bulgarian")
- my ("Burmese")
- ca ("Catalan")
- ceb ("Cebuano")
- zh-Hans ("Chinese (Simplified)")
- zh-Hant ("Chinese (Traditional)")
- co ("Corsican")
- hr ("Croatian")
- cs ("Czech")
- da ("Danish")
- dv ("Divehi")
- nl ("Dutch")
- en ("English")
- eo ("Esperanto")
- et ("Estonian")
- ee ("Ewe")
- fil ("Filipino")
- fi ("Finnish")
- fr ("French")
- gl ("Galician")
- lg ("Ganda")
- ka ("Georgian")
- de ("German")
- el ("Greek")
- gn ("Guarani")
- gu ("Gujarati")
- ht ("Haitian Creole")
- ha ("Hausa")
- haw ("Hawaiian")
- iw ("Hebrew")
- hi ("Hindi")
- hmn ("Hmong")
- hu ("Hungarian")
- is ("Icelandic")
- ig ("Igbo")
- id ("Indonesian")
- ga ("Irish")
- it ("Italian")
- ja ("Japanese")
- jv ("Javanese")
- kn ("Kannada")
- kk ("Kazakh")
- km ("Khmer")
- rw ("Kinyarwanda")
- ko ("Korean")
- kri ("Krio")
- ku ("Kurdish")
- ky ("Kyrgyz")
- lo ("Lao")
- la ("Latin")
- lv ("Latvian")
- ln ("Lingala")
- lt ("Lithuanian")
- lb ("Luxembourgish")
- mk ("Macedonian")
- mg ("Malagasy")
- ms ("Malay")
- ml ("Malayalam")
- mt ("Maltese")
- mi ("Māori")
- mr ("Marathi")
- mn ("Mongolian")
- ne ("Nepali")
- nso ("Northern Sotho")
- no ("Norwegian")
- ny ("Nyanja")
- or ("Odia")
- om ("Oromo")
- ps ("Pashto")
- fa ("Persian")
- pl ("Polish")
- pt ("Portuguese")
- pa ("Punjabi")
- qu ("Quechua")
- ro ("Romanian")
- ru ("Russian")
- sm ("Samoan")
- sa ("Sanskrit")
- gd ("Scottish Gaelic")
- sr ("Serbian")
- sn ("Shona")
- sd ("Sindhi")
- si ("Sinhala")
- sk ("Slovak")
- sl ("Slovenian")
- so ("Somali")
- st ("Southern Sotho")
- es ("Spanish")
- su ("Sundanese")
- sw ("Swahili")
- sv ("Swedish")
- tg ("Tajik")
- ta ("Tamil")
- tt ("Tatar")
- te ("Telugu")
- th ("Thai")
- ti ("Tigrinya")
- ts ("Tsonga")
- tr ("Turkish")
- tk ("Turkmen")
- uk ("Ukrainian")
- ur ("Urdu")
- ug ("Uyghur")
- uz ("Uzbek")
- vi ("Vietnamese")
- cy ("Welsh")
- fy ("Western Frisian")
- xh ("Xhosa")
- yi ("Yiddish")
- yo ("Yoruba")
- zu ("Zulu")
Now notice that there are three types of transcripts available:
Notice that both English (en) and Chinese (zh-Hans) are available in Manually Created and Translation Languages. As a result, you can only access these two languages from Manually Created by following lines of code:
transcript_obtained_in_english = transcript_list.find_transcript(["en"]).fetch()
print(transcript_obtained_in_english)
transcript_obtained_in_chinese = transcript_list.find_transcript(["zh-Hans"]).fetch()
print(transcript_obtained_in_chinese)
However, you cannot access these two languages from Translated Languages, because they have the authors of this Github repository have made unique language codes for each language.
To access translated languages from Translation Languages list (e.g. Russian (ru)), you can write following command:
for transcription in transcript_list:
transcript_obtained_in_russian = transcription.translate("ru").fetch()
print(transcript_obtained_in_russian)
Unfortunately if you write "en" in the last line of code, it will give an error. However' if you write "zh-Hans", it will work surprisingly. Then after this replace "zh-Hans" with "en", it will work this time. This is a very strange behavior that I couldn't figure out. May be because the original audio is in in "zh-Hans" when we translate that "eng" starts to work as well. It may be a minor issue in code logic.
Anyways, hope this solves your issue.
extremely grateful
Hi @MaxwellEdisons,
Sorry for the late reply!
@t-naeem already pointed out most things (thank you for that!), but I think the main reason this is happening is that when you call get_transcript
without specifying a language code, 'en'
will be used as a default. You did not provide the you use to fetch the transcript, but I would assume that you did not provide language code (in which case 'en'
was used). So by calling transcript.translate("en")
you're trying to translate english to english and it seem that YouTube returns an empty transcript if you do that. It will actually happen on the YouTube page as well if you set the language to english and then try to translate to english.
I think I might add a check to the translate
method in the future to make sure the target language is different from the source language, as translate from en
to en
doesn't really make sense I guess. So it's not too surprising that YouTube returns rubbish.
DO NOT DELETE THIS! Please take the time to fill this out properly. I am not able to help you if I do not know what you are executing and what error messages you are getting. If you are having problems with a specific video make sure to include the video id.
To Reproduce
A video with non-English subtitles, e.g. https://www.youtube.com/watch?v=aJ957mEAFis. returns an empty list when I use transcript.translate(['en']).fetch(). If I replace en with another language like 'de' it returns fine.
Which Python version are you using?
Python 3.11
Expected behavior
I'm not sure if this is the way the youtube interface returns or if there is a bug in the code. if you have time could you help check the problem?