[x] Create a paid account with OpenAI and get your API key.
Add key to env variables
[x] 1- Create an initial script for answering questions with OpenAI GPT 3.5. Basic function that takes an input, prompts around, and answers a question.
Code from scratch
[x] 2- Add super basic rag with numpy and cosine similarity and embeddings ada function and a small dataset example in json.
[x] 3- Then compare with llamaindex to introduce it and why we use it in the future
Update function to do RAG with llamaindex. Basic example but with llamaindex.
[x] 4- Replace database from json to Chroma or other
script to embed and create a vector store from csvs
script of how we chunked and got the data from an example using useScrapper https://usescraper.com/
with basic chunking script (nb char)
[x] 5-Improve prompting for the question and add sources (references)
[x] 6- Advanced: write a script to create questions for the dataset with GPT4
Script for evaluation script ragas or other with llamaindex or other
Run evaluation
[x] 7- Improve chunking (sections, titles in sections…)- re evaluate
[x] 8- Script for Fine-tuning embedded based GPT4 generated questions above.
[x] 9- Replace Ada with Cohere or a better embedding model or HF.- re evaluate
[x] 10- Add reranking or open-source model (Cohere?)- re evaluate
[x] 11- Add Hybrid search- re evaluate
[x] 12- Improve query (reformulation, more details…)- re evaluate
[x] 13- Add router for data source optimization - re evaluate
[ ] 19- Write the lessons from code built (notebooks initially, then we need to teach to replicate our full repo without giving it to them. They need to learn and work!). Report to the full syllabus (https://www.notion.so/seldonia/Full-syllabus-564070f715b2455d9a6b945b0b470c6b). Don't forget to re-use parts of the ebook we just did to save time.
[ ] 20- Add multi-lingual section with image showing embeddings are the same, so valuable to have content in different languages, etc...