jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.54k stars 279 forks source link

Getting transcript in the original language #252

Closed mgoldenbe closed 4 months ago

mgoldenbe commented 5 months ago

What is the most straightforward way to get transcript in the language spoken in the video? Is it to first get the list of auto-generated transcripts? If so:

GorujoCY commented 5 months ago

This kind of post belongs to Discussions or StackOverflow than an issue... this is a question not an issue.

mgoldenbe commented 5 months ago

@GorujoCY I only see the button New Issue. How do I open a discussion?

jdepoix commented 4 months ago

Hi @mgoldenbe, There's three questions here I think:

What is the most straightforward way to get transcript in the language spoken in the video?

There is none unfortunately. Currently english is always used as the default language. There is an issue open for changing the default behaviour to return the default transcript of the video (#133), but it hasn't been implemented yet. However, even this wouldn't guarantee that the transcript you get is the language spoken in the video.

Can I be certain that the list of auto-generated transcripts will contain exactly one element?

To be honest, I don't know. This module just pulls information from YouTube and it's hard to give guarantees about anything YouTube is doing. My guess would be that there's some cases where there could be multiple, but there's only one in most. So in those cases where there's only one, you could use that as a hint towards which language is the language spoken in the video. But my experience in working on this module has been that there's basically an exception for everything, so there will most certainly be some weird cases,where this logic doesn't work out. Feel free to share your findings if you play around with this!

What can I do when the author removed auto-generated transcript, such as for this video

Nothing you can do here I think.

I will close this now, as there's not really a fixable issue here, but feel free to update here if you play around with inferring the language from the auto-generated transcripts, or create a discussion as others have suggested (go to https://github.com/jdepoix/youtube-transcript-api/discussions and press the "New Discussion" button in the top right).