Vanna trulens performance metrics

This PR adds a script to support improving the performance (accuracy, cost and latency) of a vanna app.

The Problem;

The various components and prompts contribute to performance, but it's not clear how each of these impact it.
Making improvements means changing something, then manually assessing the new outputs. This is not a scalable way of evaluating.

Context;

vn.ask() carries out RAG in multiple steps that can all be optimised;

Retrieve examples of 3 different data types (SQL, DDL etc.)
- parameters: embedding model chosen, retrieval system, retrieval parameters
Connects to LLM model
- parameters: model chosen, fine-tune vs not.
Prompts the LLM about each of these in different ways.

Further improvements to vanna in the future could open up even more possibilities like;

Self-corrective systems like diagnosing the SQL error and retry the database call.
Chain of thought reasoning for complex questions
Multi-hop programs for complex SQL generation i.e. "break a question into multiple SQL sub-queries to validate a hypothesised correct SQL".

The solution;

A script implements trulens-eval that allows configuration of what is to be evaluated, and how. It presents the results in a dashboard (see the doc for visuals)

Evaluation of the system using TruLens allows evaluation without changing vanna (just adding a log to the vanna model). Alternatives could be to include evaluation in the app's code itself, this might require major refactoring to decouple the vanna components.

Other evaluation frameworks exist, though not many as of yet.

Tests performed

Manual/hand testing only, and only used a few example prompts (shown in the code). No unit tests

vanna-ai / vanna