jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.54k stars 279 forks source link

Enhance Timestamp Duration for More Content in Transcript Segments #251

Closed chiragksharma closed 5 months ago

chiragksharma commented 5 months ago

Discussed in https://github.com/jdepoix/youtube-transcript-api/discussions/250

Originally posted by **chiragksharma** January 18, 2024 ### Issue Description Currently, the transcript segments generated from YouTube videos include timestamps that are too brief, often containing only a single line or a very short paragraph of text. This format leads to an excessive number of timestamps with minimal content under each, which can be inconvenient for users seeking more substantial information per segment. ### Example of Current Output ``` [0.16] just one year after launch chat GPT has [2.56] well over 100 million daily active users [5.279] and next week open AI is opening the [7.04] floodgates allowing developers to profit [9.12] on their platform by selling custom GPT 11.44] agents all you have to do is convince 1% [13.48] of the user base to pay you $1 per month ``` ### Desired Output ``` [00:00](https://www.youtube.com/watch?v=undefined&t=0s) I'm Arvind Srinivas. I'm the co-founder and CEO of perplexity AI. Perplexity is a conversational and search engine that aims to deliver answers to you, to whatever questions you may ask. We are trying to revolutionize how people consume information online. Instead of getting ten blue links, they can just ask questions in natural language and just get it answered instantly. [00:19](https://www.youtube.com/watch?v=undefined&t=19s) And we launch the product on December 7th, 2022. We have like about 10 million monthly active users at this point. It's basically grown thousand X over a period of one year. So I grew up in India, studied in one of the IIT's there, and I was really into algorithms programing ever since the beginning. [00:47](https://www.youtube.com/watch?v=undefined&t=47s) A friend of mine told me about a machine learning contest, which I didn't even know what machine learning was, what? All they told me was, hey, there's this data set and you can figure out a way to predict the output given the input. And it was fun. And I won the contest and I didn't spend a lot of time on it, and it came more naturally. [01:02](https://www.youtube.com/watch?v=undefined&t=62s) So I decided to go deeper into it. And I went and did my PhD in Berkeley on AI and deep learning. I worked at OpenAI in 2018 summer as a research intern. I thought I was good, okay, I did really well in India. I came to Berkeley. I'm like, definitely one of the top AI PhD students. And then I went to OpenAI and I felt like really bad because people were so much better than me. [01:22](https://www.youtube.com/watch?v=undefined&t=82s) It was a big reality check that, okay, I could improve a lot more in programing. I could improve a lot more in first principles. Thinking my clarity of thoughts. After an internship at OpenAI in 2018, that was when GPT 1 was published. We realized that there is this new form of learning using all the internet data and learning from it, and I figured that was going to be more important. [01:41](https://www.youtube.com/watch?v=undefined&t=101s) So I told my advisor that this is the right thing to do. We should go work on this. And he was actually like pretty open minded and said, okay, you know what? Like I'm not a specialist here, but let's try. I mean, if this is the next thing, the best way to learn a new topic is to force yourself to teach it to others. ``` Can we get this link as well so that whenever the user clicks, it redirects to that time stamp. I am building a flask app using this api and a beginner in writing code. Please help me solve this problem, been stuck on it for a long time. ### My current code ``` def get_transcript(video_id): transcript_list = YouTubeTranscriptApi.get_transcript(video_id,languages=['en', 'hi']) transcript = '' for segment in transcript_list: start_time = segment['start'] transcript += f"[{start_time}] {segment['text']}\n\n" return transcript ```
jdepoix commented 5 months ago

Hi @chiragksharma, I am sorry but this is neither an issue with this module nor a feature request, it is more of a general programming question, therefore I will close this issue. It is okay to be discussion in the discussions section, or you could ask on StackOverflow, as this really is not related to this module. However, you should be able to find all the information you need to implement this on StackOverflow, without creating a new question.

chiragksharma commented 5 months ago

Hey @jdepoix Thank you for your response and guidance regarding the handling of my query. I understand that my question falls outside the typical issues or feature requests for this module. However, I'd like to propose a documentation enhancement based on my recent experience with the module.

Suggestion:

I propose adding an example of a custom formatter that allows users to adjust the time duration of transcript segments in the documentation. This addition could serve as a valuable reference for future users looking to customize their transcript formats beyond the standard options.

Contribution:

I have already implemented such a function in my project and would be happy to contribute this to the documentation or the main formatters.py file. I believe this could save time for others facing similar requirements.

Could you please let me know if adding this as a pull request would be appropriate, and if so, would you prefer it in the documentation or the codebase?

Looking forward to your thoughts on this.