[prometheus] improve logging capabilities

The current version of the Prometheus entrypoint mimics kaistai's eval and saves eval outputs in a json file together with the input data for easier comparison (so you have e.g. all questions, model responses, and GPT4 + prometheus scores ready to be compared).

This can be improved, e.g.:

better defining how we store generated results (in the local, ray, and wandb-backed scenarios)
improving the current format for the wandb artifacts (e.g. do we want the current file-based format, and / or add cumulative metrics for a test?)
moving away from the full eval dataset to individual model's evaluations, as lineage allows us to always recover the input dataset

mozilla-ai / lm-buddy

[prometheus] improve logging capabilities #81