Closed rishibommasani closed 2 years ago
Also compare with reported numbers (both SOTA and LLM evals).
@percyliang With the new results from v4, I have:
So far, the only issue I have found is for the commonsense QA dataset, mainly HellaSwag. We also do worse for OpenBookQA, but this may be an adaptation issue we recently discussed. For BoolQ, we have reasonable results but happen to do noticably worse for GPT-3 models, but not sure why (don't think its much of an issue). Finally for QuAC, we do worse, but actually we do fairly well (as discussed below) and, if I remember correctly, we do more legitimate and more challenging evaluation conditions than what OpenAI did in their works, so this is not surprising. Finally in a few cases we do a little bit better, but never by more than ~8 points and usually by 2-3 pointers (conditional on doing better, i.e. >2 points).
TL;DR: I am satisfied with our ability to replicate all past results for AI21 and OpenAI models, with the exception of HellaSwag.
TL;DR: The evaluations generally fare pretty well against SOTA and prompting SOTAs, and we make some cool findings.
So, overall, in terms of verifying our results are sane and pass the sniff test, I am fairly confident on this. And further, our results convey interesting findings and match up wherever we have comparisons. I will revisit this for other models once Together results are all in, but things are looking pretty good!
Only thing I really see as concerning is HellaSwag, which I let Michi know about on Slack (@michiyasunaga), and still waiting to look through language modeling results and Together models.
I am going to close this issue for now as I see no pending matter that we need an issue for, though feel free to keep open if you think that is better.
On the whole, sounds great! Definitely need to fix HellaSwag. I do think it's worth tracking down the differences on OpenbookQA, BoolQ, and any other places where we are doing an exact replication that's not matching up - in case they help us resolve bugs.
Cool sounds good - i will make p2 issues and ping respective scenario creators.
I think the main point is we may not be able to really clarify whether the delta is (caused by):
So i think since we narrowed to only a few scenarios, its totally worth exploring to explain why, but it is not dire straits if we cannot figure it out/etc.
Regarding HellaSwag (and OBQA): To my knowledge, Yuhui (@yuhui-zh15)'s implementation reproduced the GPT-3 results in earlier versions of our codebase: https://github.com/stanford-crfm/benchmarking/pull/90 (reproduced both HellaSwag and OBQA), https://github.com/stanford-crfm/benchmarking/pull/604 (reproduced OBQA), https://github.com/stanford-crfm/benchmarking/pull/826 (reproduced OBQA). I agree that some of those changes Rishi mentioned might be the cause of performance change.
Hi! The Hellaswag’s performance difference seems to be because we use the wrong adaption method.
In Github issue, I reported the best adaption is MCQA-seperate (#90): GPT-3 175B (Paper, using CLM): 78.9 GPT-3 175B CLM: 81.5 for randomly sampled 200 questions. GPT-3 175B CalibratedCLM: 46.0 for randomly sampled 200 questions. GPT-3 175B MCQA: 28.0 for randomly sampled 200 questions.
But from CRFM website, it seems we are using MCQA-seperate-calibrated (https://crfm-models.stanford.edu/static/benchmarking.html?runs). So the 46% results from CRFM website match my previous results. It is just because we use the wrong adaption method :)
Thanks @yuhui-zh15 for the insight!
oh fantastic, so @yuhui-zh15 we should:
@percyliang Looks like we may have found a solution for the hellaswag/openbookqa numbers (which also strengthen the ablation we will run on adaptation method), looks like the quac discrepancy is to be expected as we are doing a harder evaluation, and we have an open issue for boolq.
So that is good progress, and we will revisit this as more SOTAs get reported by folks.
Notes:
schema
to address any errors).