manisnesan / til

collection of today i learned scripts
4 stars 0 forks source link

Embeddings What are they and they matter #59

Open manisnesan opened 1 year ago

manisnesan commented 1 year ago

https://simonwillison.net/2023/Oct/23/embeddings/

manisnesan commented 1 year ago

Further reading

manisnesan commented 4 months ago

Fine tuning the embedding models

Creating a Pipeline for Generating Synthetic Data for Fine-Tuning Custom Embedding Models. 👀

Step 1 Create a Knowledge Base: Start with preparing your domain specific knowledge base, such as PDFs or other documents containing information. Convert the content of these documents into a plain text format.

Step 2 Chunk the Data: Divide your text data into manageable chunks of approximately 256 tokens each (chunk size used in RAG later).

Step 3 Generate Questions Using LLM: Use a Language Model (LLM) to generate K questions for each chunk of text. The questions should be answerable based on the content within the chunk. Example prompt: "Generate five questions that can be answered using the following text: [insert chunk here]."

Step 4 Optionally Generate Hard Negative Examples: Create hard negative examples by generating questions that are similar to the correct questions but have answers that are incorrect or misleading. Alternatively, use random other samples from the batch as negative examples during training (in-batch negatives).

Step 5 Deduplicate and Filter Pairs: Remove “duplicate” question-context pairs to ensure uniqueness. Use the LLM to judge and filter out lower-quality pairs by defining custom rubrics for quality assessment.

Step 6 Fine-Tune Embedding Models: Use the prepared data to fine-tune your embedding models with Sentence Transformers 3.0Use the prepared data to fine-tune your embedding models with Sentence Transformers 3.0

https://x.com/_philschmid/status/1798388387822317933?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ