Build an eval dataset in the form of input-output pairs. Outputs are generated manually first followed by synthetic generation sourcing from the actual documents.
Build RAG to generate context ie retrieve relevant information by search & then add to the Context section in prompt.
Evaluate models with a metric (quality criteria) for the use case
Run benchmarks using different combinations of the system (RAG + prompt engineering + Model) and rank them by comparing the results with the eval dataset
Deploy the best combination to production with guardrails
Analyze the patterns from production to find failure scenarios in order to a) improve the eval dataset 2) quality of context or 3) quality of model (eg: fine-tuning)
Repeat again Deploy & Analyze cycle
Note: Quality of context (RAG) is important as the quality of the model.
via LLMOps - swarooch notes