langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93k stars 14.94k forks source link

use youtube chapter as hints and metadata in the youtube loader #7366

Closed thiswillbeyourgithub closed 1 week ago

thiswillbeyourgithub commented 1 year ago

Feature request

When using the youtube loader. I think it would be useful to take into account the chapters if present.

  1. The chapter timecode could be used to know when to chunk. Any chunk inside a chapter timeframe could also contain the same "youtube_chapter_title" metadata.
  2. The name of the chapter could added directly inside the transcript. For example as a markdown header. This could be useful for LLM to maintain context over time.

Motivation

There are useful information present in the youtube chapter title and timecodes that could be of use to LLMs.

Summarizing transcripts would probably be of higher quality if headers are present rather than a huge wall of text.

Adding metadata is always a win.

Your contribution

Unfortunately not able to help for the time being but wanted to get the idea out there.

AmanSal1 commented 1 year ago

@thiswillbeyourgithub Can I give it a shot ?

thiswillbeyourgithub commented 1 year ago

I'm hardly in a position to lead anything but sure, absolutely. Thanks a lot. I can happily give an opinion and light review of the code though. Thanks again!

AmanSal1 commented 1 year ago

@thiswillbeyourgithub oh okay !! Actually I have recently started contributing to open source so I really want to contribute to lang chain . So by any chance you know how and where are the issues assigned as I am not aware about the repository much ?

thiswillbeyourgithub commented 1 year ago

I think you just have to familiarize yourself with the contributing guidelines and make a PR :)

AmanSal1 commented 1 year ago

@thiswillbeyourgithub so like if we find an issue and nobody is working on it then we can directly submit a PR without assigning. Right?

thiswillbeyourgithub commented 1 year ago

I think so yeah.

dosubot[bot] commented 11 months ago

Hi, @thiswillbeyourgithub! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested a feature to use YouTube chapters as hints and metadata in the YouTube loader. This would involve using chapter timecodes and titles to improve the quality of summarized transcripts by adding headers and maintaining context over time. You mentioned that you are unable to contribute to the implementation at the moment but wanted to share the idea.

I noticed that AmanSal1 has expressed interest in working on this feature and asked for guidance on how to contribute. You responded by suggesting that they familiarize themselves with the contributing guidelines and make a pull request. AmanSal1 also asked if they can submit a PR for an unassigned issue, and you confirmed that it is possible.

If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

Best regards, Dosu

thiswillbeyourgithub commented 11 months ago

Yes this is still relevant

dosubot[bot] commented 11 months ago

@baskaryan Could you please help @thiswillbeyourgithub with this issue? They have indicated that it is still relevant. Thank you!

dosubot[bot] commented 8 months ago

Hi, @thiswillbeyourgithub,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue requests the use of YouTube chapter information in the YouTube loader to improve the quality of summarized transcripts. You had mentioned that you are unable to contribute at the moment but wanted to share the idea. A user named AmanSal1 has expressed interest in working on this feature and asked for guidance on how to contribute.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

thiswillbeyourgithub commented 8 months ago

I still do think it's a valuable feature to incorporate chapters as metadata. Or even if someone manages : to include chapter transition into the text directly using timestamps.

iamuv2000 commented 5 months ago

@thiswillbeyourgithub I'd love to give this a shot, I modified this to extract the description, I think a bit of regex should allow me to extract the timestamps (from description) if available

jonespm commented 3 months ago

It looks like someone already put a PR for this feature into the youtube-transcript-api (which I believe this users). Not sure how active the maintainer for that is. https://github.com/jdepoix/youtube-transcript-api/pull/254