Closed aittalam closed 3 months ago
This seems like something that we may want to ask the W&B folks directly. They may have a suggestion for how we could best use Table artifacts within reports.
And if that functionality just doesn't exist, then yes, we could log the results in a different manner.
Thanks, just sent a message to W&B folks! I personally don't like too much the idea of locking ourselves too much with them, but at the same time we are logging everything there, I think it makes sense to at least have some basic plots for free!
Closing issue as wandb became optional
With eval results logged as table artifacts, comparing metrics through more than two runs does not seem intuitive (or possible?).
Example: try to compare Mistral and distilgpt2 on the
hellaswag
task (you open one json and then click on the other one and choose "compare"), but you cannot do that on multiple runs (see Figure 1)One can save table results within the run, by adding something like
to
log_evaluation_artifact in
jobs/lm_harness/entrypoint.py`. This allows to join information from multiple runs, still it is not clear how to use aggregated info (eg. mean and std) in the same plot (see Figure 2)I think artifacts have some additional value (e.g. versioning) wrt run data. Still, they are probably not the best if want to have some quick comparisons within wandb panels and reports. We should agree on how we want to log this information and whether we want to rely on wandb for aggregated plots or if we want to access artifact data programmatically.
(my 2 cents: perhaps as a start we could have both artifacts and run info to simplify wandb plotting, then evaluate if we miss anything after we run a large-enough set of evaluations)