Open manisnesan opened 11 months ago
Both langchain and llama integration available
Short Guide on Colbert V2 https://x.com/anmolsj/status/1744499524113158207?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ
Ideas
Document chunking based on model context size or specified chunk length. Used llamaindex SentenceSplitter
TrainingDataProcessor pipeline: converts any kind of query pairs, or triplets into Colbert friendly format. SimpleDataMiner mines hard negatives for every single query. It greatly simplifies the data preparation step.
hard negatives. why harg negatives are needed?
For highly specific domain such as bio, finance etc fine tuning may be needed for the retriever. Use jxnlco 's instructor (???) to get GPT-4 to generate queries (synthetic queries) & then let RAGatouille to take care of the rest.
From https://github.com/bclavie/RAGatouille/blob/main/examples/03-finetuning_without_annotations_with_instructor_and_RAGatouille.ipynb (RAGAtouile + Instructor (See: LLM Validation): Finetuning ColBERT(v2) with no annotated data)
Getting annotated data is expensive! Thankfully, the literature in retrieval has recently shown that synthetic data can yield similar, if not better, performance when fine-tuning retrieval models. This means we can fine-tune to our target domain without needing pricey and time-consuming annotations. In this tutorial, we'll show how easily we can leverage Jason Wei's instructor library for structured extraction with OpenAI's functional calling API to generate meaningful query-passage pairs.
late interaction model is a super power for the RAG pipeline. bag-of-embeddings approaches - similar to bag of words works on small information of units & represents a doc as a sum of these units - similar to embeddings works on semantic level, the actual way a given piece of text is phrase is not important since the models learns the meaning. Refer for more depth. https://ben.clavie.eu/ragatouille/#tldr
Related work: JaColBERT (ColBERT based for ja, document retrieval trained on ja data, strong performance). Announcement: https://twitter.com/bclavie/status/1739788857397088660 Report: https://t.co/DvWZA0cdxT Model: https://t.co/F5gvjYUonw
See the exploration here https://github.com/manisnesan/fastchai/tree/master/ragatouille
retrieval model
Doc exploration
late interaction retrievers in zero shot task (compared apples to apples) they're very easy to adapt to new domains due to their bag-of-embeddings approach.
constraints
end outcomes
Next Steps
retrieval | pros | cons |
---|---|---|
bm25/keyword based sparse retrieval | fast, consistent performance, intuitive & well understood, no training required | exact match req, no semantic info & hits hard perf ceiling |
cross-encoder | very strong perf, leverages semantic info to large extent especially negation understanding* | major scalability issues: retrieve scores by query-doc comparison (commonly used in reranking setting |
dense retrieval/embeddings | fast, decent performance overall, pre-trained, leverage semantic information | though semantic but lacks constrastive info ie no negation understanding, finnicky fine tuning, requires billion params(eg: e5-mistral), billion pre-train samples for top perf, poor generalisation |
negation understanding* - I love apples vs I hate apples
Source: https://ben.clavie.eu/ragatouille/#longer-might-read
https://github.com/bclavie/RAGatouille
Announcement tweet by bclavie
https://news.ycombinator.com/item?id=38869223
See Colbert issue https://github.com/manisnesan/AISC-WG-Search-Recsys/issues/23