jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

Can't read chinese subtitle in R #198

Closed eyedu closed 1 year ago

eyedu commented 1 year ago

I am using R to perform the function get_caption.

Tried the code on numerous url links but none of it works on videos with chinese subtitle. I am sure that the subtitle is available in all this videos.

https://www.youtube.com/watch?v=4HLSBvlv0Ug https://www.youtube.com/watch?v=oE0yPwT-c3Q&t=1s https://www.youtube.com/watch?v=uIqegdIwtW0&t=72s

jdepoix commented 1 year ago

Hi @eyedu,

please elaborate on what exactly you are executing and what the output is. This is a python module and there is no get_caption function.

eyedu commented 1 year ago

Hi, Thanks for getting back to me.

I was trying to run the package in R for any videos with chinese subtitles and I always get the same error message.

I am sure that the video comes with caption but I just don't know why it didn't work. I try to use python to do the same action but it didn't work as well.

Please see my R code below.

url <- ("https://www.youtube.com/watch?v=4HLSBvlv0Ug&t=85s") caption <- get_caption(url) Error: youtube_transcript_api._errors.NoTranscriptFound: <... omitted ...>omali")

  • st ("Southern Sotho")
  • es ("Spanish")
  • su ("Sundanese")
  • sw ("Swahili")
  • sv ("Swedish")
  • tg ("Tajik")
  • ta ("Tamil")
  • tt ("Tatar")
  • te ("Telugu")
  • th ("Thai")
  • ti ("Tigrinya")
  • ts ("Tsonga")
  • tr ("Turkish")
  • tk ("Turkmen")
  • uk ("Ukrainian")
  • ur ("Urdu")
  • ug ("Uyghur")
  • uz ("Uzbek")
  • vi ("Vietnamese")
  • cy ("Welsh")
  • fy ("Western Frisian")
  • xh ("Xhosa")
  • yi ("Yiddish")
  • yo ("Yoruba")
  • zu ("Zulu")

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem! See reticulate::py_last_error() for details

jdepoix commented 1 year ago

I am still not sure what get_caption does, but seeing that code I would assume that you are using the url as a video id. The video id for https://www.youtube.com/watch?v=4HLSBvlv0Ug&t=85s is 4HLSBvlv0Ug.

Does that solve your problem?

eyedu commented 1 year ago

Hi Jonas, thanks for getting back to me, I used python re-run the code and this is what I got.

from youtube_transcript_api import YouTubeTranscriptApi

YouTubeTranscriptApi.get_transcript("4HLSBvlv0Ug")


NoTranscriptFound                         Traceback (most recent call last)
<ipython-input-3-fcae0a96a7d8> in <module>
      1 from youtube_transcript_api import YouTubeTranscriptApi
      2 
----> 3 YouTubeTranscriptApi.get_transcript("4HLSBvlv0Ug")

~/opt/anaconda3/lib/python3.8/site-packages/youtube_transcript_api/_api.py in get_transcript(cls, video_id, languages, proxies, cookies)
    130         """
    131         assert isinstance(video_id, str), "`video_id` must be a string"
--> 132         return cls.list_transcripts(video_id, proxies, cookies).find_transcript(languages).fetch()
    133 
    134     @classmethod

~/opt/anaconda3/lib/python3.8/site-packages/youtube_transcript_api/_transcripts.py in find_transcript(self, language_codes)
    177         :raises: NoTranscriptFound
    178         """
--> 179         return self._find_transcript(language_codes, [self._manually_created_transcripts, self._generated_transcripts])
    180 
    181     def find_generated_transcript(self, language_codes):

~/opt/anaconda3/lib/python3.8/site-packages/youtube_transcript_api/_transcripts.py in _find_transcript(self, language_codes, transcript_dicts)
    213                     return transcript_dict[language_code]
    214 
--> 215         raise NoTranscriptFound(
    216             self.video_id,
    217             language_codes,

NoTranscriptFound: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=4HLSBvlv0Ug! This is most likely caused by:

No transcripts were found for any of the requested language codes: ('en',)

For this video (4HLSBvlv0Ug) transcripts are available in the following languages:

(MANUALLY CREATED)
 - zh-TW ("Chinese (Taiwan)")[TRANSLATABLE]

(GENERATED)
None

(TRANSLATION LANGUAGES)
 - af ("Afrikaans")
 - ak ("Akan")
 - sq ("Albanian")
 - am ("Amharic")
 - ar ("Arabic")
 - hy ("Armenian")
 - as ("Assamese")
 - ay ("Aymara")
 - az ("Azerbaijani")
 - bn ("Bangla")
 - eu ("Basque")
 - be ("Belarusian")
 - bho ("Bhojpuri")
 - bs ("Bosnian")
 - bg ("Bulgarian")
 - my ("Burmese")
 - ca ("Catalan")
 - ceb ("Cebuano")
 - zh-Hans ("Chinese (Simplified)")
 - zh-Hant ("Chinese (Traditional)")
 - co ("Corsican")
 - hr ("Croatian")
 - cs ("Czech")
 - da ("Danish")
 - dv ("Divehi")
 - nl ("Dutch")
 - en ("English")
 - eo ("Esperanto")
 - et ("Estonian")
 - ee ("Ewe")
 - fil ("Filipino")
 - fi ("Finnish")
 - fr ("French")
 - gl ("Galician")
 - lg ("Ganda")
 - ka ("Georgian")
 - de ("German")
 - el ("Greek")
 - gn ("Guarani")
 - gu ("Gujarati")
 - ht ("Haitian Creole")
 - ha ("Hausa")
 - haw ("Hawaiian")
 - iw ("Hebrew")
 - hi ("Hindi")
 - hmn ("Hmong")
 - hu ("Hungarian")
 - is ("Icelandic")
 - ig ("Igbo")
 - id ("Indonesian")
 - ga ("Irish")
 - it ("Italian")
 - ja ("Japanese")
 - jv ("Javanese")
 - kn ("Kannada")
 - kk ("Kazakh")
 - km ("Khmer")
 - rw ("Kinyarwanda")
 - ko ("Korean")
 - kri ("Krio")
 - ku ("Kurdish")
 - ky ("Kyrgyz")
 - lo ("Lao")
 - la ("Latin")
 - lv ("Latvian")
 - ln ("Lingala")
 - lt ("Lithuanian")
 - lb ("Luxembourgish")
 - mk ("Macedonian")
 - mg ("Malagasy")
 - ms ("Malay")
 - ml ("Malayalam")
 - mt ("Maltese")
 - mi ("Māori")
 - mr ("Marathi")
 - mn ("Mongolian")
 - ne ("Nepali")
 - nso ("Northern Sotho")
 - no ("Norwegian")
 - ny ("Nyanja")
 - or ("Odia")
 - om ("Oromo")
 - ps ("Pashto")
 - fa ("Persian")
 - pl ("Polish")
 - pt ("Portuguese")
 - pa ("Punjabi")
 - qu ("Quechua")
 - ro ("Romanian")
 - ru ("Russian")
 - sm ("Samoan")
 - sa ("Sanskrit")
 - gd ("Scottish Gaelic")
 - sr ("Serbian")
 - sn ("Shona")
 - sd ("Sindhi")
 - si ("Sinhala")
 - sk ("Slovak")
 - sl ("Slovenian")
 - so ("Somali")
 - st ("Southern Sotho")
 - es ("Spanish")
 - su ("Sundanese")
 - sw ("Swahili")
 - sv ("Swedish")
 - tg ("Tajik")
 - ta ("Tamil")
 - tt ("Tatar")
 - te ("Telugu")
 - th ("Thai")
 - ti ("Tigrinya")
 - ts ("Tsonga")
 - tr ("Turkish")
 - tk ("Turkmen")
 - uk ("Ukrainian")
 - ur ("Urdu")
 - ug ("Uyghur")
 - uz ("Uzbek")
 - vi ("Vietnamese")
 - cy ("Welsh")
 - fy ("Western Frisian")
 - xh ("Xhosa")
 - yi ("Yiddish")
 - yo ("Yoruba")
 - zu ("Zulu")

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
jdepoix commented 1 year ago

Hi @eyedu, I think the error message is fairly descriptive here: as you did not specify which language you want, the English transcript is requested, however, there is no English transcript. Just add the language you want: YouTubeTranscriptApi.get_transcript("4HLSBvlv0Ug", languages=['zh-TW'])