Improvement: Improve (recursive) chunked summarization technique

rmusser01 commented 1 month ago

Currently the approach to chunked summarization is to naively split the input data into chunks based on token count, or on time lengths noted within the JSON output from faster_whisper.

I would like to improve on this approach, though not really sure at this point.

This issue is to track this discussion and act as a dumping ground for potential ideas.

rmusser01 commented 1 month ago

Links: https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization https://www.pinecone.io/learn/chunking-strategies/ Auto-Chapters: https://www.assemblyai.com/docs/audio-intelligence/auto-chapters The 5 Levels Of Text Splitting For Retrieval - https://www.youtube.com/watch?v=8OJC21T2SL4 Large Document summarization GCP: https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents.ipynb https://docs.llamaindex.ai/en/stable/examples/response_synthesizers/tree_summarize/

rmusser01 commented 1 month ago

Moving comment to this issue for tracking:

A comment chain from reddit that seems relevant:

cyan2k
My experience over the past 2-ish years:

RAG IS FUCKING DIFFICULT.

We're talking about a client shelling out >$200K for a solution that barely scrapes a 60% hit rate—the 
kind that would spark outrage and spawn strongly worded Reddit threads if you knew just how much it sucks.

So, set your expectations accordingly. You're not getting your perfect RAG in an afternoon.

"What do you mean difficult? It's three lines of code with LangChain, lol, noob," Sure, getting started is 
simple but making it actually good? lol right back at you.

You see plenty of people disappointed with RAG, claiming it doesn't meet business needs and isn't as 
great as hyped. But in my opinion, it’s more a case of "people don't have a fucking clue." They put together 
the most overengineered data transformation pipelines possible, using the latest "meta" they picked up from
 Medium or TikTok, all held together by the shittiest code lifted straight from LangChain and LlamaIndex. 
Then they head over to [r/MachineLearning](https://www.reddit.com/r/MachineLearning/) to complain about 
how RAG sucks, totally overlooking that your latest paper on some obscure 
RAG optimization means shit in the real world with real data.

Or they opt for the lazy approach... Just chunk it up, rely on large context windows, dump everything into a
 single vector store, and trust in the magic of the LLM to somehow make the result good. But then reality 
hits when it hallucinates the shit out over the 12,000 tokens you fed it. "Needle in a haystack," my ass. 
This approach has also zero real-world applicability.

So yeah, it's tough, but I promise, you can make any use case work. But getting there is real work, and you 
need to decide if the effort is worth it.

My Recommendations:

    Set Up an Evaluation Pipeline Early: You need to know how good your system is. Don’t start any real work 
without the ability to quantify performance. You must be able to say, "If I switch this prompt for that one, 
performance improves by 7% but hallucination increases by 10%."

    Educate Yourself: Dive into resources like ragas, dspy, and probably also libraries providing 
structured output guardrails like outlines.

    Choose Your Tracing/Logging Framework: Make friends with a framework like W&B, Langsmith, Langfuse, 
and my personal favorite, Phoenix.

    ragas docs: https://docs.ragas.io/en/stable/

    DSP< GitHub repository: https://github.com/stanfordnlp/dspy

    Outlines GitHub repository: https://github.com/outlines-dev/outlines

    wandb site: https://wandb.ai/site

    Langsmith: https://www.langchain.com/langsmith

    Phoenix: https://phoenix.arize.com/

Now you have the tools to really measure and optimize your system!

This is mostly trial and error, as dumb as it sounds. The tech is still too new to have a solid
 "best practice" guide. Data varies so much from Source A to Source B that the strategy has to be 
tailored for each client, for each use case. However, some ideas from a couple of successful projects:

if you need over 8k tokens, your chunking strategy, retrieval process, ranking, or whatever, SUCKS. 
That's why it blows my mind every time I hear people complain that Llama3 only has an 8k token 
context. What do you even need more tokens for? What kind of magical text do you have that is 
so informationally dense over 5000 words that you can't split it?

8k is optimal because with most models (even those crazy high-context window ones like Yi), the 
first 8k tokens are amazingly accurate, and they ALL drop off. Perhaps not in your "passphrase" 
search test, but in real life with real end-users, they will.

Less Is More: Don’t get caught up in merging 20 different RAG pipelines from LangChain and 30
 from LlamaIndex all put together into one frankenstein monster of a pipeline just because you 
see in the docs that it's a direct implementation of some new paper. Stick to what works and use 
your own brain. Financial data, for example, often just boils down to a few key numbers, but 
understanding them requires a web of related data across documents. Classic RAG will let you 
down here. You need a system that maps entities via relationships like a Knowledge Graph.

The paper you read yesterday about RAPTOR, HYDE or some new "hot shit" RAG stuff... yeah try it out, but keep
 in mind they are usually not tested with real use cases. Alone the fact that you as the study creator 
know that you only test on benchmarks makes your whole study biased imho, and 90% of those papers 
don't publish a single line of code anyway, which is fucking stupid, and should be prohibited, because who
 the fuck knows if the numbers they put into their study are even true. Don't be desillusionized if it don't
 work out. Also never go to [r/machinelearning](https://www.reddit.com/r/machinelearning/) for your 
LLM questions. Saddest place on reddit lol.

Have Stamina! Good Luck!

TL;DR: Set up an evaluation pipeline, improve your process step by step. Keep it simple and 
chill with context size and optimization craziness.

----->
arthurwolf

    Just chunk it up, rely on large context windows, dump everything into a single vector store,
 and trust in the magic of the LLM to somehow make the result good. But then reality hits 
when it hallucinates the shit out over the 12,000 tokens you fed it

The solution we implemented is similar to this but with an extra step.

We gather data *very* liberally (using both a keyword and a vector based search), get anything
 that might be related. Massive amounts of tokens.

Then we go over each result, and for each result, we ask it « is there anything in there that 
matters to this question? <question>. if so, tell us what it is ».

Then with only the info that passed through that filter, we do the actual final prompt as 
you'd normally do (at that point we are back down to pretty low numbers of tokens).

Got us from around 60% to a bit over 85%, and growing (which is fine for our use case).

It's pretty fast (the filter step is highly parralelizable), and it works for *most* requests 
(but fails miserably for a few, something for which we're implementing contingencies).

However, it is expensive. Talking multiple cents per customer question. That might not
 be ok for others. We are exploring using (much) cheaper models for the filter and seeing good results so far.

----->----->
nightman

I recommend to try Reranking (like Cohere reranking and filtering based on relevance_score) 
instead of current filtering. It might not work for you but it's a middle ground between naive
 vector store retreival and checking each document with LLM if it fits.

rmusser01 / tldw

Improvement: Improve (recursive) chunked summarization technique #31