Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
A executable Python notebook (e.g. Colab) would be a good demonstration of the Python API and its internals. Levanter and DSP both have good notebooks.
Things to document:
[ ] How to use AutoClient and Service, and how to provide the credentials and cache
[ ] Step-by-step internals of what Runner does during a single evaluation run.
[ ] How to add your own model
[ ] How to add your own scenario (might be tricky)
A executable Python notebook (e.g. Colab) would be a good demonstration of the Python API and its internals. Levanter and DSP both have good notebooks.
Things to document:
AutoClient
andService
, and how to provide the credentials and cacheRunner
does during a single evaluation run.