Open manisnesan opened 1 year ago
Further reading
Fine tuning the embedding models
Creating a Pipeline for Generating Synthetic Data for Fine-Tuning Custom Embedding Models. 👀
Step 1 Create a Knowledge Base: Start with preparing your domain specific knowledge base, such as PDFs or other documents containing information. Convert the content of these documents into a plain text format.
Step 2 Chunk the Data: Divide your text data into manageable chunks of approximately 256 tokens each (chunk size used in RAG later).
Step 3 Generate Questions Using LLM: Use a Language Model (LLM) to generate K questions for each chunk of text. The questions should be answerable based on the content within the chunk. Example prompt: "Generate five questions that can be answered using the following text: [insert chunk here]."
Step 4 Optionally Generate Hard Negative Examples: Create hard negative examples by generating questions that are similar to the correct questions but have answers that are incorrect or misleading. Alternatively, use random other samples from the batch as negative examples during training (in-batch negatives).
Step 5 Deduplicate and Filter Pairs: Remove “duplicate” question-context pairs to ensure uniqueness. Use the LLM to judge and filter out lower-quality pairs by defining custom rubrics for quality assessment.
Step 6 Fine-Tune Embedding Models: Use the prepared data to fine-tune your embedding models with Sentence Transformers 3.0Use the prepared data to fine-tune your embedding models with Sentence Transformers 3.0
https://x.com/_philschmid/status/1798388387822317933?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ
https://simonwillison.net/2023/Oct/23/embeddings/