jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.87k stars 326 forks source link

youtubetranscript.com cc selection option #179

Open pasdesinfos opened 1 year ago

pasdesinfos commented 1 year ago

Is your feature request related tweets o a problem? Please describe. :( Unknown error: Could not retrieve a transcript for the video http://www.youtube.com/watch?v=oBfDbucxPU4! This is most likely caused by: No transcripts were found for any of the requested language codes: ('en',) For this video (oBfDbucxPU4) transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - es ("Spanish (auto-generated)")[TRANSLATABLE] (TRANSLATION LANGUAGES) - af ("Afrikaans") - ak ("Akan") - sq ("Albanian") - am ("Amharic") - ar ("Arabic") - hy ("Armenian") - as ("Assamese") - ay ("Aymara") - az ("Azerbaijani") - bn ("Bangla") - eu ("Basque") - be ("Belarusian") - bho ("Bhojpuri") - bs ("Bosnian") - bg ("Bulgarian") - my ("Burmese") - ca ("Catalan") - ceb ("Cebuano") - zh-Hans ("Chinese (Simplified)") - zh-Hant ("Chinese (Traditional)") - co ("Corsican") - hr ("Croatian") - cs ("Czech") - da ("Danish") - dv ("Divehi") - nl ("Dutch") - en ("English") - eo ("Esperanto") - et ("Estonian") - ee ("Ewe") - fil ("Filipino") - fi ("Finnish") - fr ("French") - gl ("Galician") - lg ("Ganda") - ka ("Georgian") - de ("German") - el ("Greek") - gn ("Guarani") - gu ("Gujarati") - ht ("Haitian Creole") - ha ("Hausa") - haw ("Hawaiian") - iw ("Hebrew") - hi ("Hindi") - hmn ("Hmong") - hu ("Hungarian") - is ("Icelandic") - ig ("Igbo") - id ("Indonesian") - ga ("Irish") - it ("Italian") - ja ("Japanese") - jv ("Javanese") - kn ("Kannada") - kk ("Kazakh") - km ("Khmer") - rw ("Kinyarwanda") - ko ("Korean") - kri ("Krio") - ku ("Kurdish") - ky ("Kyrgyz") - lo ("Lao") - la ("Latin") - lv ("Latvian") - ln ("Lingala") - lt ("Lithuanian") - lb ("Luxembourgish") - mk ("Macedonian") - mg ("Malagasy") - ms ("Malay") - ml ("Malayalam") - mt ("Maltese") - mi ("Māori") - mr ("Marathi") - mn ("Mongolian") - ne ("Nepali") - nso ("Northern Sotho") - no ("Norwegian") - ny ("Nyanja") - or ("Odia") - om ("Oromo") - ps ("Pashto") - fa ("Persian") - pl ("Polish") - pt ("Portuguese") - pa ("Punjabi") - qu ("Quechua") - ro ("Romanian") - ru ("Russian") - sm ("Samoan") - sa ("Sanskrit") - gd ("Scottish Gaelic") - sr ("Serbian") - sn ("Shona") - sd ("Sindhi") - si ("Sinhala") - sk ("Slovak") - sl ("Slovenian") - so ("Somali") - st ("Southern Sotho") - es ("Spanish") - su ("Sundanese") - sw ("Swahili") - sv ("Swedish") - tg ("Tajik") - ta ("Tamil") - tt ("Tatar") - te ("Telugu") - th ("Thai") - ti ("Tigrinya") - ts ("Tsonga") - tr ("Turkish") - tk ("Turkmen") - uk ("Ukrainian") - und ("Unknown Language") - ur ("Urdu") - ug ("Uyghur") - uz ("Uzbek") - vi ("Vietnamese") - cy ("Welsh") - fy ("Western Frisian") - xh ("Xhosa") - yi ("Yiddish") - yo ("Yoruba") - zu ("Zulu") If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

Describe the solution you'd like When available auto-generated subtitl, to be translated to en and transcribed as per default

Describe alternatives you've considered cc selection option

Additional context n/a

erseco commented 1 year ago

Same error here. Maybe adding an option to select language solves the problem :)

ghost commented 1 year ago

yeah same here, option to select would be good.

jdepoix commented 1 year ago

Hi @pasdesinfos, I definitely see the use case for a feature where transcripts are auto-translated if they are not available in the requested language. However, this should not be the default. As this module is commonly used to train/validate Machine Learning models, translating the transcripts will introduce another variable into the data quality, which the user should always be aware of (by opting into it).

I actually thought about introducing this as an optional feature before, but there is an implementation detail that stopped me from doing so: if we want to automatically translate to the user-requested language, which transcript do we choose to translate from (if there are multiple)? Depending on the transcript we are translating from, the quality of the output will vary. A few things to consider:

So which heuristic for choosing the transcript to translate from, is most likely to yield the highest quality transcript? Any thoughts on this?

ghost commented 1 year ago

@jdepoix First of all, I don't know what it means to translate transcripts, but the ASRs created in Turkish were understandable, if not completely accurate.

erseco commented 1 year ago

Hi, IMHO the problem is when the main language of the video is in another language different to English, @toprak, @pasdeinfos and I are talking about adding an option (or allow automatically) the option of getting the source video original generated subtitles, not about translating them. If you get any Spanish video like this one: https://youtubetranscript.com/?v=Dby0_0vdr30 you will see the error, in the CLI tool you have to set the Spanish language to allow getting the correct transcript

Hope this explains the use case, best regards!

jdepoix commented 1 year ago

Hi @toprak and @erseco, I think what you are asking for is something different and it already is documented as a feature request in #133. To my understanding, @pasdesinfos is asking about a feature where the transcripts are automatically translated to the requested language if no transcripts are available in that language. Could you maybe clarify @pasdesinfos to make sure we are on the same page here?

pasdesinfos commented 1 year ago

Hi @jdepoix @toprak @erseco,

I trust everything is well.

That's right @jdepoix. For instance, in the output for the video https://youtu.be/BOKqyl0VT7A , https://youtubetranscript.com/?v=BOKqyl0VT7A, indicates "No transcripts were found for any of the requested language codes: ('en',)", however it appears that "transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - fr ("French (auto-generated)")[TRANSLATABLE] ".

Could the heuristic be obtaining, by default, the auto-translated english version, when GENERATED transcript exists and is TRANSLATABLE. Ergo the output ":( Unknown error" will appear only in the event no transcripts at all exist.

Kind regards to everyone!

p-toni commented 1 year ago

Hi all. Same here, only if the YT source isn't in EN. As mentioned, just a selector can handle it.

pasdesinfos commented 1 year ago

Hi @jdepoix, @toprak, @erseco, @toniseldr,

I wanted to take a moment to express my heartfelt gratitude to each of you for your invaluable contributions, unwavering dedication, commitment, and hard work. Your efforts have truly made a significant impact in making lives more wonderful. πŸ™πŸŽ‰

I mean, let's be honest here, without your brilliance, I'd probably be lost in a sea of confusion and chaos. πŸŒŠπŸ˜…

With self-deprecating humor and sincere appreciation, @pasdesinfos πŸ˜„πŸ™Œ

jdepoix commented 1 year ago

Hi @pasdesinfos,

thank you very much for the kind words! 😊

However, this hasn't been implemented so I think it is okay for the ticket to stay open. Although I am not actively working on this, it might be something that someone wants to contribute to!

MarouaneZhani commented 1 month ago

Hi, Im getting the same error with the following video: https://www.youtube.com/watch?v=EtpRcefOD6M even if I specify the correct language 'de' in the languages parameter :

from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

loader = YoutubeTranscriptReader() documents = loader.load_data( ytlinks=['https://www.youtube.com/watch?v=EtpRcefOD6M'], languages=["de","en"] )

Do you have any idea how can this be solved ?

jdepoix commented 1 month ago

Hi @MarouaneZhani, what is the exact error message you are getting?

MarouaneZhani commented 1 month ago

Hi @jdepoix
Sorry I already got it running using "de-DE" in languages, the error that I was getting : Could not retrieve a transcript for the video https://www.youtube.com/watch?v=EtpRcefOD6M This is most likely caused by: No transcripts were found for any of the requested language codes.

I saw somewhere in the error the available code language was something like that "de-DE" and it worked after trying it !

Thanks Marouane