The current version of the Prometheus entrypoint mimics kaistai's eval and saves eval outputs in a json file together with the input data for easier comparison (so you have e.g. all questions, model responses, and GPT4 + prometheus scores ready to be compared).
This can be improved, e.g.:
better defining how we store generated results (in the local, ray, and wandb-backed scenarios)
improving the current format for the wandb artifacts (e.g. do we want the current file-based format, and / or add cumulative metrics for a test?)
moving away from the full eval dataset to individual model's evaluations, as lineage allows us to always recover the input dataset
Closed because not relevant anymore (might pick it up again if we decide to add LLM-as-judge again into our evals but it will likely be part of a larger effort)
The current version of the Prometheus entrypoint mimics kaistai's eval and saves eval outputs in a json file together with the input data for easier comparison (so you have e.g. all questions, model responses, and GPT4 + prometheus scores ready to be compared).
This can be improved, e.g.: