Open-source tools for prompt testing and experimentation, with support for both LLMs (e.g. OpenAI, LLaMA) and vector databases (e.g. Chroma, Weaviate, LanceDB).
Evaluation functions are now expected to take in a row of a pandas DataFrame (which contains inputs, responses, and previously computed metrics), with optional additional keyword args
TODO to prevent breakage:
Verify:
GPT4vsLlama2.ipynb
LlamaHeadToHead.ipynb
GPT4Regression
Update Experiment.rank()
Follow-up:
Ensure pivot functionality can still work where needed
Refactor
Experiment
This has BC-breaking changes:
row
of apandas
DataFrame (which contains inputs, responses, and previously computed metrics), with optional additional keyword argsTODO to prevent breakage:
Experiment.rank()
Follow-up: