stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.96k stars 254 forks source link

Verify results make sense #637

Closed rishibommasani closed 2 years ago

rishibommasani commented 2 years ago

Notes:

  1. Wait until #635 is merged to use tables to facilitate.
  2. Pay attention to the standard metrics being the correct metric (and update schema to address any errors).
  3. Pay attention to the standard metrics generally being better than robustness/fairness (related to #626 and #597).
  4. Pay attention to ICE and various harms stuff (esp. bias metrics), since not fully sold on all of these being correct.
percyliang commented 2 years ago

Also compare with reported numbers (both SOTA and LLM evals).

rishibommasani commented 2 years ago

@percyliang With the new results from v4, I have:

  1. Compared our results to all prior results for the same (scenario, model pairs). Prior results are documented in https://docs.google.com/spreadsheets/d/1jpm0Cy0r5Yk_l9o9kwVpxbSfdKrx6ZcQ7iQfs7QYPIc/edit#gid=1233800927, with bold indicating we have the same number (i.e. same (scenario, model, metric)). Blue means our results are notably better (2+points better) and pink means notably worse (2+ points worse).

So far, the only issue I have found is for the commonsense QA dataset, mainly HellaSwag. We also do worse for OpenBookQA, but this may be an adaptation issue we recently discussed. For BoolQ, we have reasonable results but happen to do noticably worse for GPT-3 models, but not sure why (don't think its much of an issue). Finally for QuAC, we do worse, but actually we do fairly well (as discussed below) and, if I remember correctly, we do more legitimate and more challenging evaluation conditions than what OpenAI did in their works, so this is not surprising. Finally in a few cases we do a little bit better, but never by more than ~8 points and usually by 2-3 pointers (conditional on doing better, i.e. >2 points).

TL;DR: I am satisfied with our ability to replicate all past results for AI21 and OpenAI models, with the exception of HellaSwag.

  1. For all scenarios where we have a SOTA documented (which scenario authors are responsible for; if they do not end up doing, I will eventually just do myself), I have compared our results in https://docs.google.com/spreadsheets/d/1jpm0Cy0r5Yk_l9o9kwVpxbSfdKrx6ZcQ7iQfs7QYPIc/edit#gid=142619249. To save time pay attention to column B, where I detail the contrast. We set several new SOTAs here, and make several intriguing findings (see NarrativeQA, XSUM, RAFT, IMDB, LSAT, CivilComments, and TruthfulQA in particular).

TL;DR: The evaluations generally fare pretty well against SOTA and prompting SOTAs, and we make some cool findings.

So, overall, in terms of verifying our results are sane and pass the sniff test, I am fairly confident on this. And further, our results convey interesting findings and match up wherever we have comparisons. I will revisit this for other models once Together results are all in, but things are looking pretty good!

Only thing I really see as concerning is HellaSwag, which I let Michi know about on Slack (@michiyasunaga), and still waiting to look through language modeling results and Together models.

I am going to close this issue for now as I see no pending matter that we need an issue for, though feel free to keep open if you think that is better.

percyliang commented 2 years ago

On the whole, sounds great! Definitely need to fix HellaSwag. I do think it's worth tracking down the differences on OpenbookQA, BoolQ, and any other places where we are doing an exact replication that's not matching up - in case they help us resolve bugs.

rishibommasani commented 2 years ago

Cool sounds good - i will make p2 issues and ping respective scenario creators.

I think the main point is we may not be able to really clarify whether the delta is (caused by):

  1. We did something different, and something that is fundamentally worse overall (since we know we did things different intentionally, e.g. our policy for in-context example filling with class balancing, our use of the same in-context examples across eval instances when compared with gpt3 approach, different prompts sometimes since we standardize)
  2. Sources of variation that are fundamental to the paradim/that we dont try to minimize (e.g. we use different in-context examples {hence also different example order}, different test examples {since we use only 1k}, different outcomes even given same output distribution due to sampling)

So i think since we narrowed to only a few scenarios, its totally worth exploring to explain why, but it is not dire straits if we cannot figure it out/etc.

michiyasunaga commented 2 years ago

Regarding HellaSwag (and OBQA): To my knowledge, Yuhui (@yuhui-zh15)'s implementation reproduced the GPT-3 results in earlier versions of our codebase: https://github.com/stanford-crfm/benchmarking/pull/90 (reproduced both HellaSwag and OBQA), https://github.com/stanford-crfm/benchmarking/pull/604 (reproduced OBQA), https://github.com/stanford-crfm/benchmarking/pull/826 (reproduced OBQA). I agree that some of those changes Rishi mentioned might be the cause of performance change.

yuhui-zh15 commented 2 years ago

Hi! The Hellaswag’s performance difference seems to be because we use the wrong adaption method.

In Github issue, I reported the best adaption is MCQA-seperate (#90): GPT-3 175B (Paper, using CLM): 78.9 GPT-3 175B CLM: 81.5 for randomly sampled 200 questions. GPT-3 175B CalibratedCLM: 46.0 for randomly sampled 200 questions. GPT-3 175B MCQA: 28.0 for randomly sampled 200 questions.

But from CRFM website, it seems we are using MCQA-seperate-calibrated (https://crfm-models.stanford.edu/static/benchmarking.html?runs). So the 46% results from CRFM website match my previous results. It is just because we use the wrong adaption method :)

michiyasunaga commented 2 years ago

Thanks @yuhui-zh15 for the insight!

rishibommasani commented 2 years ago

oh fantastic, so @yuhui-zh15 we should:

  1. Use non-calibrated CLM for HellaSwag
  2. Use joint for OpenbookQA
rishibommasani commented 2 years ago

@percyliang Looks like we may have found a solution for the hellaswag/openbookqa numbers (which also strengthen the ablation we will run on adaptation method), looks like the quac discrepancy is to be expected as we are doing a harder evaluation, and we have an open issue for boolq.

So that is good progress, and we will revisit this as more SOTAs get reported by folks.