langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
53.63k stars 7.85k forks source link

webscraper i guess is not able to extract the transcript from youtube video links so can you add any mini tool to help summarize youtube videos easily #2470

Closed Greatz08 closed 8 months ago

Greatz08 commented 9 months ago

Self Checks

1. Is this request related to a challenge you're experiencing?

:-) maybe add tool or make webscrapper advance or something to help in youtube video summarization :-)

2. Describe the feature you'd like to see

:-))

3. How will this feature improve your workflow or experience?

:-0

4. Additional context or comments

No response

5. Can you help us with this feature?

dosubot[bot] commented 9 months ago

🤖

It seems like you've got this under control, if you want help or have specific questions, let me know what I can do for you!

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Yeuoly commented 9 months ago

Yes, you are right, WebScraper is a tool for scraping the original html code from a website and summarize if the visible content is too long. summarizing a YouTube video requires a more complex and specialized tool, before doing that, I would love to know a little more about what scenario you have?

Greatz08 commented 9 months ago

@Yeuoly yeah i also thought same thing that scraper cant extract transcript by default as it is not directly present by default in html and there is transcript button which we have to press to see the transcript for that youtube video just wanted to test anyways hehe anyways my usecase was simple - to get summary for any youtube video i want. What i was thinking was to make keybinding in my hyprland tiling window manager which would run python script to use dify api to give youtube url to dify and then dify would fetch video information and summarize it automatically for me using tool (which right now it doesnt have) and then store summary result in variable and then we can use that variable info to either put that summary in decent html frontend page and open that page automatically in browser so that i can read from there Or show that in terminal (as i am linux user so i love terminal :-) and using with hyprland will make things fast for me to switch to terminal and read ) so in this way i was thinking to automate stuff for my need because sometimes i am short on time and want to see just key points of it and many extensions are build which creates another side panel in youtube and feed transcript info with javascript functions to chatgpt with gpt cookies and chatgpt summarize and then they took the output and display in that side panel of youtube created by that extension (glarity is one extension which provides this feature and is open source too i guess but it is gpt only thing so i actually give idea to dify some months ago to maybe work on extension which can integrate with dify api and similar thing can be achieved and user is not restricted to just chatgpt and in this way people will have more reason to use dify but unfortunately i didnt get any future plans on that anyways) i hope you understand what i wanted and how i would use it for my needs.Btw flowise have implemented one additional functionality which give users option to use javascript which is too powerful and useful option and if same thing can be implemented in dify then it will be possible i believe to extract youtube transcript and summarize it and many more usecases can be there with its implementation.

Yeuoly commented 9 months ago

It's a great idea I thought, however, summarizing a YouTube video requires several AI abilities.

  1. Video understanding, for now, only Gemini Ultra support this, but costs lots of tokens, and we don't have api key now, anyway, we will add the model ability as a Tool recently, such as Vision understanding.
  2. OCR, it's simple to extract text from a single image, but when it's a video, it causes a high CPU/GPU usage and cost a long time.
  3. Audio transforming, we can transform the audio information into text, so that LLM could understand what the video wants to represent.
  4. There is a tool in langchain called langchain.document_loaders.YoutubeLoader, but it based on youtube_transcript_api which only have the transcript of popular videos like Mr. Beast and so on. but it's simple to implement this tool, you can try to implement it yourself by following our tutorial in just 30 minutes.

Now you can see, A single tool cannot finish all the tasks, it requires a cooperation of tools, and most of them require GPUs or Apis. Maybe GPT-4 can do that by using different tools together.

BTW, our workflow also has a sandbox environment, so that you can execute Python/Node.js code also, and it can also satisfy your needs!

Greatz08 commented 9 months ago

@Yeuoly i think we dont have to go to that much extent to gather those things which you mentioned altogether to generate summary for video.As i explained above we just simply need transcript ( which is all text of what youtuber said in whole video and that is automatically generated for every youtube video according to youtube algorithm and we can even see that with just single click of button on every youtube video so we simply need a way to get transcript for that video because whatever user is speaking in whole video is the only thing we need, from that only we can summarize what video was all about and what all things were covered in video so we dont need any heavy thing for summarization i hope you understand it.There are some websites or services in which we can direct pass youtube link of video and it generate automatically transcript for us and we can fetch that transcript and feed it to dify as context and it can generate summary easily so this is the easy path used by some projects and we can use it to in dify but for that some tool has to be made which will respond to youtube link specially and fetch transcript from external service api which generate transcript and feed to our chat and ask it summarize for us or have to use javascript plus or any library of python which can help in transcript fetching easily so have to think on it

Yeuoly commented 9 months ago

Yep, I agree with you, a simple python or javascript plugin could work on it, btw, have you got any good suggestions of which library to use?

Greatz08 commented 9 months ago

@Yeuoly https://pypi.org/project/youtube-transcript-api/ this works perfectly fine i tested this with one youtube video id, you can test yourself too.We can trim id value from youtube url and then use that that with this python script.We can then fetch the text key value part from each dictionary element that we get and merge together to have perfect transcript for youtube video and feed that too dify. But what i think problem will be is that context size is limited so it wont be able to generate summary easily so we have to create knowledge base for it smartly so that when we ask question it refer to that knowledge base automatically and then generate summary for us but all this will be troublesome thing i guess maybe i am wrong do share your opinion on this.This is problem for those who are running gemini 2000+ approx tokens or chatgpt 3.5 or 4 with 4000 tokens approx max context limit. Those who are running higher end models with 16k+ or 30k+ context window size, i believe can directly pass transcript content in dify and get summary so still this feature has to be implemented just that for those with lower token limits we have to think more to fix this context issue 20240223_20h48m20s_grim

github-actions[bot] commented 8 months ago

Close due to it's no longer active, if you have any questions, you can reopen it.