[Feature Request] Sumarization toolkit and examples

ProGM commented 1 year ago

One feature I would love to have in Langchain.rb that may be super-useful is summarization:

https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html

I don't think it's super hard to implement. (at least: a base version of it)

andreibondarev commented 1 year ago

@ProGM Hmm... we can definitely add a def summarize(text:) method to every LLM class.

To list them out:

Cohere already has a Summarize endpoint: https://github.com/andreibondarev/cohere-ruby#summarize
OpenAI has a few prompt-driven examples (search for "summary/summari..."): https://platform.openai.com/examples. I personally kind of like the "TL;DR" method. Do you have a preference?
Google PaLM summarization would be prompt-driven as well.
Hugging Face has a ton of different models focusing on summarization. Have you tried any of these by chance? https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
Same with Replicate -- most likely prompt-driven summarization would be needed.

When I say prompt driven I mean that we'd build something like the following prompt:

Write a concise summary of the following:

#{text_to_be_summarize}

CONCISE SUMMARY:

... and pass this to the LLM. Btw -- this prompt was taken from here.

@ProGM What are your thoughts?

ProGM commented 1 year ago

@andreibondarev Not sure if this should be just a method of the LLM classes.

When I say toolkit, I mean the full set of things: a) a method in LLM b) a set of strategies (stuff, map_reduce, refine) c) a way to use it in combination with other stuff

The cool feature you have in python LangChain is that you can configure the ready-to-use summarize chain, declare it as a tool and use it in a chain of thoughts.

Something like (pseudo-ruby-code):

summarization_tool = Langchain::Tool.new(
  name: 'summarizer tool',
  function: Langchain::Summarizer.new(strategy: :map_reduce),
  description: 'This tool can be used to summarize a long text'
)

agent = Agent::ChainOfThoughtAgent.new(
  llm: :openai,
  llm_api_key: ENV["OPENAI_API_KEY"],
  tools: ['search', 'calculator', summarization_tool]
)

andreibondarev commented 1 year ago

@ProGM Just a quick iteration on top of your pseudo-ish code:

cohere = LLM::Cohere.new(...) # Let's say you want to use Cohere's summarize endpoint

summarization_tool = Langchain::Tool.new(
  name: "summarization_tool",
  function: -> { |text| cohere.summarize(text: text),
  description: "This tool can be used to summarize a long text."
)

agent = Agent::ChainOfThoughtAgent.new(
  llm: :openai,
  llm_api_key: ENV["OPENAI_API_KEY"],
  tools: ['search', 'calculator', summarization_tool]
)

What're your thoughts?

andreibondarev commented 1 year ago

@ProGM This PR would address the first part of this.

andreibondarev commented 1 year ago

Source: https://docs.langchain.com/docs/use-cases/summarization

A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then?

The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

In order to tackle the issue of summarizing documents that exceed the context window -- I think what we could do is to enhance the summarize() methods to check the length of the text being passed in and if it's too long then recursively split -> summarize -> combine -> summarize.

ProGM commented 1 year ago

@ProGM Just a quick iteration on top of your pseudo-ish code:

cohere = LLM::Cohere.new(...) # Let's say you want to use Cohere's summarize endpoint

summarization_tool = Langchain::Tool.new(
  name: "summarization_tool",
  function: -> { |text| cohere.summarize(text: text),
  description: "This tool can be used to summarize a long text."
)

agent = Agent::ChainOfThoughtAgent.new(
  llm: :openai,
  llm_api_key: ENV["OPENAI_API_KEY"],
  tools: ['search', 'calculator', summarization_tool]
)

What're your thoughts? @andreibondarev

That's exactly what I meant. It would be great! 🎉

@ProGM This PR would address the first part of this.

Cool!

Source: https://docs.langchain.com/docs/use-cases/summarization

A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then? The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

In order to tackle the issue of summarizing documents that exceed the context window -- I think what we could do is to enhance the summarize() methods to check the length of the text being passed in and if it's too long then recursively split -> summarize -> combine -> summarize.

Yup, I think that this concept is implemented with the refine strategy in Langchain: https://github.com/hwchase17/langchain/blob/9c0cb90997db9eb2e2a736df458d39fd7bec8ffb/langchain/chains/summarize/refine_prompts.py

And we may need a tokenizer library to count tokens, like this or this.

andreibondarev commented 1 year ago

@ProGM I think an incremental next step would be adding tiktoken_ruby to wrap the OpenAI API calls to ensure that token limits are not exceeded when completion endpoint is being hit.

andreibondarev commented 1 year ago

@ProGM Would something like this work as a good starting point? https://github.com/andreibondarev/langchainrb/pull/71

I think the next step in that summarization workflow would be to recursively check the token length as the passed in text is being summarized. BUT I think it has to wait until the chunking work is done!

ProGM commented 1 year ago

@andreibondarev Thanks for keeping me up to date! It's a good start for sure.

I think that token limit is not something exclusive of OpenAI. PaLM should have 8000 tokens. Anthropic has 100k (that are a lot, but yet a limit)

andreibondarev commented 1 year ago

Yeah, I just meant that the Tiktoken library is only for OpenAI models and no others.

ProGM commented 1 year ago

Oh, I didn't know about it! D:

patterns-ai-core / langchainrb

[Feature Request] Sumarization toolkit and examples #59