jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

Results of list_transcripts retrieves info in chinese and not in english #209

Closed MauroCSHPYP closed 1 year ago

MauroCSHPYP commented 1 year ago

DO NOT DELETE THIS! Please take the time to fill this out properly. I am not able to help you if I do not know what you are executing and what error messages you are getting. If you are having problems with a specific video make sure to include the video id.

To Reproduce

Steps to reproduce the behavior:

What code / cli command are you executing?

Here's the full code:

!pip install youtube-transcript-api
!pip install pytube

from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi

video_id = 'hD1YtmKXNb4'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
print(transcript_list)

Results:

For this video (hD1YtmKXNb4) transcripts are available in the following languages:

(MANUALLY CREATED)
 - zh-Hant ("中文(繁體)")[TRANSLATABLE]
 - ja ("日文")[TRANSLATABLE]
 - hi ("印地文")[TRANSLATABLE]
 - es ("西班牙文")[TRANSLATABLE]
 - es-419 ("西班牙文(拉丁美洲)")[TRANSLATABLE]
 - fr ("法文")[TRANSLATABLE]
 - ar ("阿拉伯文")[TRANSLATABLE]
 - ru ("俄文")[TRANSLATABLE]
 - en-US ("英文(美國)")[TRANSLATABLE]
 - pt ("葡萄牙文")[TRANSLATABLE]
 - de ("德文")[TRANSLATABLE]
 - ko ("韓文")[TRANSLATABLE]

(GENERATED)
 - en ("英文 (自動產生)")[TRANSLATABLE]

(TRANSLATION LANGUAGES)
 - tr ("土耳其文")
 - tk ("土庫曼文")
 - lg ("干達文")
 - zh-Hant ("中文(繁體)")
 - zh-Hans ("中文(簡體)")
 - da ("丹麥文")
 - eu ("巴斯克文")
 - ja ("日文")
 - mi ("毛利文")
 - jv ("爪哇文")
 - eo ("世界文")
 - gl ("加利西亞文")
 - ca ("加泰蘭文")
 - nso ("北索托文")
 - gu ("古吉拉特文")
 - sw ("史瓦希里文")
 - ne ("尼泊爾文")
 - ny ("尼揚賈文")
 - gn ("瓜拉尼文")
 - be ("白俄羅斯文")
 - lt ("立陶宛文")
 - ig ("伊布文")
 - is ("冰島文")
 - hu ("匈牙利文")
 - id ("印尼文")
 - hi ("印地文")
 - ky ("吉爾吉斯文")
 - ay ("艾馬拉文")
 - fy ("西弗里西亞文")
 - es ("西班牙文")
 - hr ("克羅埃西亞文")
 - kn ("坎那達文")
 - iw ("希伯來文")
 - el ("希臘文")
 - hy ("亞美尼亞文")
 - az ("亞塞拜然文")
 - ta ("坦米爾文")
 - hmn ("孟文")
 - bn ("孟加拉文")
 - la ("拉丁文")
 - lv ("拉脫維亞文")
 - ln ("林加拉文")
 - fr ("法文")
 - bs ("波士尼亞文")
 - fa ("波斯文")
 - pl ("波蘭文")
 - fi ("芬蘭文")
 - ak ("阿坎文")
 - am ("阿姆哈拉文")
 - ar ("阿拉伯文")
 - sq ("阿爾巴尼亞文")
 - as ("阿薩姆文")
 - ru ("俄文")
 - bg ("保加利亞文")
 - sd ("信德文")
 - af ("南非荷蘭文")
 - kk ("哈薩克文")
 - cy ("威爾斯文")
 - co ("科西嘉文")
 - xh ("科薩文")
 - yo ("約魯巴文")
 - en ("英文")
 - dv ("迪維西文")
 - ee ("埃維文")
 - haw ("夏威夷文")
 - ku ("庫德文")
 - no ("挪威文")
 - pa ("旁遮普文")
 - th ("泰文")
 - te ("泰盧固文")
 - ht ("海地文")
 - uk ("烏克蘭文")
 - uz ("烏茲別克文")
 - ur ("烏都文")
 - ts ("特松加文")
 - zu ("祖魯文")
 - so ("索馬利文")
 - ms ("馬來文")
 - ml ("馬來亞拉姆文")
 - mk ("馬其頓文")
 - mr ("馬拉地文")
 - mg ("馬達加斯加文")
 - mt ("馬爾他文")
 - km ("高棉文")
 - ceb ("宿霧文")
 - cs ("捷克文")
 - sa ("梵文")
 - sn ("紹納文")
 - nl ("荷蘭文")
 - bho ("博傑普爾文")
 - ka ("喬治亞文")
 - su ("巽他文")
 - ti ("提格利尼亞文")
 - sk ("斯洛伐克文")
 - sl ("斯洛維尼亞文")
 - ps ("普什圖文")
 - fil ("菲律賓文")
 - vi ("越南文")
 - tg ("塔吉克文")
 - kri ("塞拉利昂克裏奧爾文")
 - st ("塞索托文")
 - sr ("塞爾維亞文")
 - om ("奧羅莫文")
 - yi ("意第緒文")
 - et ("愛沙尼亞文")
 - ga ("愛爾蘭文")
 - sv ("瑞典文")
 - it ("義大利文")
 - pt ("葡萄牙文")
 - si ("僧伽羅文")
 - ug ("維吾爾文")
 - mn ("蒙古文")
 - qu ("蓋楚瓦文")
 - ha ("豪撒文")
 - lo ("寮文")
 - de ("德文")
 - or ("歐迪亞文")
 - my ("緬甸文")
 - rw ("盧安達文")
 - lb ("盧森堡文")
 - ko ("韓文")
 - sm ("薩摩亞文")
 - ro ("羅馬尼亞文")
 - gd ("蘇格蘭蓋爾文")
 - tt ("韃靼文")

Which Python version are you using?

Python 3.10.11

Which version of youtube-transcript-api are you using?

youtube-transcript-api-0.6.0

Expected behavior

I expected to receive the results (i.e the text ) in english OR in the configured language - instead of those chinese characters. I do check the documentation to set the proxy, but, it's not clear to me where can be set the language to the list_transcripts function or the youtube_transcript_api instance.

Example:

Instead of:

It should be:

Actual behaviour

Currently, the idiom/language of the list_transcripts is in chinese (I believe). With this same code, I got the results in english.

jdepoix commented 1 year ago

Hi @MauroCSHPYP, what is the location the module is being called from? Is the code executed on a Chinese server or VPN? Either way, this is just the display name, so it shouldn't have any impact on the actual transcript which will be fetched.

MauroCSHPYP commented 1 year ago

Thanks for answering.

The code is executed on Google Colab, so I don't have the details about the VPN (if any) or additional information about it.

I suspect that the base code is using requests with no default language on the headers, hence, sometimes, the language obtained on the results varies from English.

jdepoix commented 1 year ago

Hi @MauroCSHPYP,

that is a very good point! I thought this was location/cookie-based, but adding Accept-Language: en-US to the header fixes it. I will make a PR to make it default to English.

Thank you for the suggestion!

jdepoix commented 1 year ago

Fix has been released in v0.6.1