stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).

https://crfm.stanford.edu/helm

Apache License 2.0

1.93k stars 247 forks source link

Make output_path available for perturbations and metrics. #493

Open dilarasoylu opened 2 years ago

dilarasoylu commented 2 years ago

Purpose

The information retrieval metrics require access to a qrels file downloaded as part of the caller scenario (discussed in #473). This value is currently hard coded since there doesn't seem to be a way to get this path in run_specs.py. We need to make this path available to the metrics that require it.

Another alternative is for the information retrieval metric to retrieve the exact same qrels file, but this introduces quite a few more dependencies.

rishibommasani commented 2 years ago

@dilarasoylu I think I follow this, but what exactly do you think needs to be improved?

It seems like the current version is fine, though perhaps there is something a bit more elegant. If anything, I wonder if the hard-coding of benchmark_output is an issue/too restrictive.

dilarasoylu commented 2 years ago

It seems that the output_path is a command line parameter. @teetone has been using the benchmark_output as the default output path so far, but it is possible to supply a different output path. This path is propagated into scenarios (thought self.output_path parameter), but not to do the metrics and perturbations.

The goal is to propagate the command line specified output_path to the metrics and perturbations similar to how we do it for the scenarios.

rishibommasani commented 2 years ago

Ah got it, that makes sense. Kk will look into

percyliang commented 2 years ago

I don't think we need this anymore based on how the new information retrieval scenario is defined?

rishibommasani commented 2 years ago

Closing for now, @dilarasoylu let me know if this is still needed.

dilarasoylu commented 2 years ago

Thanks Rishi!

This is still needed for perturbations. It won't change any practical functionality.

We are currently hardcoding "benchmark_output".

rishibommasani commented 2 years ago

Ok, I am going to re-open this but make p3. Maybe @dtsip can handle if its related to perturbations (iirc we simply need to snake this path through to get it to perturbations)?

yifanmai commented 11 months ago

I added get_benchmark_output_path() in #2109, which returns the value set through the flag.

The remaining work would be find all instances where "benchmark_output" is hardcoded in metrics and perturbations, and replace them with calling this method.

yifanmai commented 10 months ago

Related: #1918

AnshKhurana commented 10 months ago

Working on this!

yifanmai commented 9 months ago

@AnshKhurana Are you still working on this?