Can eval results logged as table artifacts be compared through runs?

aittalam commented 9 months ago

With eval results logged as table artifacts, comparing metrics through more than two runs does not seem intuitive (or possible?).

Example: try to compare Mistral and distilgpt2 on the hellaswag task (you open one json and then click on the other one and choose "compare"), but you cannot do that on multiple runs (see Figure 1)

One can save table results within the run, by adding something like

        wandb.log({task_name: task_table})

to log_evaluation_artifact injobs/lm_harness/entrypoint.py`. This allows to join information from multiple runs, still it is not clear how to use aggregated info (eg. mean and std) in the same plot (see Figure 2)

I think artifacts have some additional value (e.g. versioning) wrt run data. Still, they are probably not the best if want to have some quick comparisons within wandb panels and reports. We should agree on how we want to log this information and whether we want to rely on wandb for aggregated plots or if we want to access artifact data programmatically.

(my 2 cents: perhaps as a start we could have both artifacts and run info to simplify wandb plotting, then evaluate if we miss anything after we run a large-enough set of evaluations)

aittalam commented 9 months ago

References for tables on wandb:

sfriedowitz commented 9 months ago

This seems like something that we may want to ask the W&B folks directly. They may have a suggestion for how we could best use Table artifacts within reports.

And if that functionality just doesn't exist, then yes, we could log the results in a different manner.

aittalam commented 9 months ago

Thanks, just sent a message to W&B folks! I personally don't like too much the idea of locking ourselves too much with them, but at the same time we are logging everything there, I think it makes sense to at least have some basic plots for free!

aittalam commented 3 months ago

Closing issue as wandb became optional

mozilla-ai / lm-buddy

Can eval results logged as table artifacts be compared through runs? #46