Closed ijyliu closed 1 year ago
Paired t-tests are the best approach, as it is question level
There isn't any statistical inference in prompt engineering methods papers in the literature. Here's why (confirm this with Prof. Bamman):
We can't really do inference because we can only test each method once.
We have a tricky situation because we give the same questions to each method, violating independence and defeating both parametric and non-parametric tests.
Paired T test on questions not OK because the difference in getting a question right wrong is 0 or 1, not at all normal.
Similar independence problem.
Length, cost, vocab share of words, Flesch reading ease, ratios/differences of scores in prompts v. responses, human assessments. All of these are question level. Theoretically we would like to be able to test their means across questions against no prompting.
One metric - change in accuracy divided by change in (average?) tokens for direct prompting vs. engineering - is computed at the method level, not question level, and it might make more sense to do a permutation test or bootstrapped CIs there.
Maybe in future work, people should administer evaluations to LLMs multiple times using each method. In each eval, repeat the same questions. Then you can possibly do some inference.
A weird note is that, theoretically, one could use different questions for each method. This seems like it would limit comparability but might open up statistical inference?
Wald CIs for accuracy. Look at classification or other domain datasets.
Potentially use bootstrap CIs for other metrics.
It may be the case that n is large enough that standard errors are very tight...
We do probably have independence of test questions drawn...
Normality comes from the CLT...
Is fixing lack of inference in existing lit a novel controbution
For the wald or proportions test, the ci overlap conclusion is a difference of proportions conclusion if the samples have similar sample size and variance
For the wald or proportions test, the ci overlap conclusion is a difference of proportions conclusion if the items have similar sample size and variance
check if paired proportions, paired t-tests, paired permutation tests are appropriate
overall it seems likely statistical testing will be unnecessary due to the large sample size and fact that questions administered are identical - leading to similar variances...
things will be very powerful...
Find summarization paper with Paired T test again
The paired t-test paper: https://aclanthology.org/2023.acl-srw.1.pdf
This can be done as a comparison of metrics between the methods and direct prompting
Use a t-test for a difference of sample means, or potentially a permutation test, or a bootstrap. perhaps try all of them for robustness
CIs might be the cleanest approach, as that would allow you to quickly calculate the side-metrics based off of the accuracy rates
Many papers in the literature do not include any statistical inference!