RAGatouille exploration

manisnesan commented 9 months ago

https://github.com/bclavie/RAGatouille

Announcement tweet by bclavie

https://news.ycombinator.com/item?id=38869223

See Colbert issue https://github.com/manisnesan/AISC-WG-Search-Recsys/issues/23

manisnesan commented 9 months ago

Both langchain and llama integration available

manisnesan commented 8 months ago

Short Guide on Colbert V2 https://x.com/anmolsj/status/1744499524113158207?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

Ideas

bag ofembedding
- max sim = cosine of query token and doc token and then adds the max sim across all query tokens to get the final scores
Scaling - by pruning all low scoring candidates

manisnesan commented 8 months ago

Library to use state of the art retrieval model ColbertV2.
RAGPretrainedModel and RAGTrainer are the key abstractions

Models loads the Colbert pretrained model, indexes all the documents and then query against them.

Trainer allows to load the query pairs, prepare the training data, train and fine tune the model.

Document chunking based on model context size or specified chunk length. Used llamaindex SentenceSplitter
TrainingDataProcessor pipeline: converts any kind of query pairs, or triplets into Colbert friendly format. SimpleDataMiner mines hard negatives for every single query. It greatly simplifies the data preparation step.
hard negatives. why harg negatives are needed?
For highly specific domain such as bio, finance etc fine tuning may be needed for the retriever. Use jxnlco 's instructor (???) to get GPT-4 to generate queries (synthetic queries) & then let RAGatouille to take care of the rest.
From https://github.com/bclavie/RAGatouille/blob/main/examples/03-finetuning_without_annotations_with_instructor_and_RAGatouille.ipynb (RAGAtouile + Instructor (See: LLM Validation): Finetuning ColBERT(v2) with no annotated data)

Getting annotated data is expensive! Thankfully, the literature in retrieval has recently shown that synthetic data can yield similar, if not better, performance when fine-tuning retrieval models. This means we can fine-tune to our target domain without needing pricey and time-consuming annotations. In this tutorial, we'll show how easily we can leverage Jason Wei's instructor library for structured extraction with OpenAI's functional calling API to generate meaningful query-passage pairs.
late interaction model is a super power for the RAG pipeline. bag-of-embeddings approaches - similar to bag of words works on small information of units & represents a doc as a sum of these units - similar to embeddings works on semantic level, the actual way a given piece of text is phrase is not important since the models learns the meaning. Refer for more depth. https://ben.clavie.eu/ragatouille/#tldr
Related work: JaColBERT (ColBERT based for ja, document retrieval trained on ja data, strong performance). Announcement: https://twitter.com/bclavie/status/1739788857397088660 Report: https://t.co/DvWZA0cdxT Model: https://t.co/F5gvjYUonw

manisnesan commented 8 months ago

See the exploration here https://github.com/manisnesan/fastchai/tree/master/ragatouille

manisnesan commented 8 months ago

retrieval model

any representation model that's either very good at and/or specifically optimised for Query --> Passage document retrieval!
Dense embeddings are generalist representation models that try to be good at it
things like ColBERT/SPLADE/SparsEmbed exist just for that retrieval task and can't be used for anything else.

manisnesan commented 8 months ago

Doc exploration

late interaction retrievers in zero shot task (compared apples to apples) they're very easy to adapt to new domains due to their bag-of-embeddings approach.

constraints

higher barrier to entry
other frameworks not really pythonic workflows

end outcomes

democratise easy training & use ColBERT pals
speed up iteration -> avoid reimplementing
strong defaults and option to tweak if need be
reusable standalone components (eg: DataProcessor, SimpleMiner for dense retrieval, TrainingDataProcessor to streamline processing & export triplets)
don't reinvent the wheel

Next Steps

why late interaction is so good? why should you use RAGatoille/ColBERT?
explore DataProcessor or our negative miners
Check ColBERT Paper and the issue again
HotPotQA - Colbert training
LlamaIndex to chunk documents, instructor and pydantic to constrain OpenAI calls, or DSPy whenever we need more complex LLM-based components!

manisnesan commented 8 months ago

retrieval	pros	cons
bm25/keyword based sparse retrieval	fast, consistent performance, intuitive & well understood, no training required	exact match req, no semantic info & hits hard perf ceiling
cross-encoder	very strong perf, leverages semantic info to large extent especially negation understanding*	major scalability issues: retrieve scores by query-doc comparison (commonly used in reranking setting
dense retrieval/embeddings	fast, decent performance overall, pre-trained, leverage semantic information	though semantic but lacks constrastive info ie no negation understanding, finnicky fine tuning, requires billion params(eg: e5-mistral), billion pre-train samples for top perf, poor generalisation

negation understanding* - I love apples vs I hate apples

Source: https://ben.clavie.eu/ragatouille/#longer-might-read

manisnesan commented 6 months ago

https://gist.github.com/JoshuaPurtell/c1182551fa609736d47df4af82f7c5ab

manisnesan commented 6 months ago

Goal: Using RAGatouille without building a index on index and keeping it in memory in scenarios of small dataset, rapid prototyping.

Created a reproducer for the issue 66 in RAGatouille here. Potential Future Improvement - Example Notebooks could be validated as part of CI.

manisnesan commented 6 months ago

Contextual.ai work on RAG 2.0

manisnesan commented 6 months ago

Youtube - Supercharge RAG with late interactions

manisnesan / fastchai

RAGatouille exploration #63