manisnesan / til

collection of today i learned scripts
4 stars 0 forks source link

Mastering RAG series #101

Open manisnesan opened 2 weeks ago

manisnesan commented 2 weeks ago

https://www.rungalileo.io/blog/tags/rag

Example Q&A system that generate questions using chunk based approach

https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints

manisnesan commented 2 weeks ago

We employ a few-shot approach to create synthetic questions, directing the model to generate five distinct and perplexing questions by utilizing the product description. The model is instructed to incorporate the exact product name from the description into each question

manisnesan commented 2 weeks ago

Select the embedding model

Initially, we will conduct experiments to determine the optimal encoder. Keeping the sentence tokenizer, LLM (GPT-3.5-turbo), and k (20) constant, we assess four different encoders:

  1. all-mpnet-base-v2 (dim 768)
  2. all-MiniLM-L6-v2 (dim 384)
  3. text-embedding-3-small (dim 1536)
  4. text-embedding-3-large (dim 1536*2) Our guiding metric is context adherence, which measures hallucinations. The metrics for these four experiments are presented in the last four rows of the table above. Among them, text-embedding-3-small achieves the highest context adherence score, making it the winner for further optimization.