launchnlp / LitCab

12 stars 0 forks source link

LitCab

Code and benchmark for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths

CaT

The directories for the CaT datasets are listed below: Dataset Directory
NQ NQ
SciQ sciq
TriviaQA triviaqa
TruthfulQA truthfulqa
WikiQA wikiqa
BioGen name_bio
WikiGen factuality_prompt
The training and evaluation files for each dataset within the corresponding directory are: Dataset Train Test
NQ train.txt test.txt
SciQ train.txt test.txt
Triviaqa train.txt test.txt
TruthfulQA train.txt test.txt
WikiQA train.txt test.txt
BioGen unlabeled_prompt_entities.txt prompt_entities.txt
WikiGen train.jsonl test.jsonl

Evaluating LLMs on CaT

To evaluate a language model for all phrase- and sentence-level datasets, run the following command:

cd script
bash get_baselines.sh <model>

where <model> is the name of the model. The script will download the model and evaluate it on all datasets. The results will be saved in the script/log directory.

Please note that we call the OpenAI GPT-4 api throught Azure for evaluation. Please set the environment variable AZURE_OPENAI_KEY to your OpenAI API key. Your can also mannually set the key in src/get_gpt_correctness.py Line 13.

To evaluate a language model for BioGen, run the following command:

cd script
bash get_baseline_long.sh

where the names of LLMs are set in script/get_baseline_long.sh Line 3. The results will be saved in the script/log directory.

To evaluate a language model for WikiGen, run the following command:

cd script
bash get_baseline_fp.sh

where the names of LLMs are set in script/get_baseline_fp.sh Line 3. The results will be saved in the script/log directory.