LitCab

Code and benchmark for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths

CaT

The directories for the CaT datasets are listed below:	Dataset	Directory
NQ	NQ
SciQ	sciq
TriviaQA	triviaqa
TruthfulQA	truthfulqa
WikiQA	wikiqa
BioGen	name_bio
WikiGen	factuality_prompt

The training and evaluation files for each dataset within the corresponding directory are:	Dataset	Train
NQ	train.txt	test.txt
SciQ	train.txt	test.txt
Triviaqa	train.txt	test.txt
TruthfulQA	train.txt	test.txt
WikiQA	train.txt	test.txt
BioGen	unlabeled_prompt_entities.txt	prompt_entities.txt
WikiGen	train.jsonl	test.jsonl

Evaluating LLMs on CaT

To evaluate a language model for all phrase- and sentence-level datasets, run the following command:

cd script
bash get_baselines.sh <model>

where <model> is the name of the model. The script will download the model and evaluate it on all datasets. The results will be saved in the script/log directory.

Please note that we call the OpenAI GPT-4 api throught Azure for evaluation. Please set the environment variable AZURE_OPENAI_KEY to your OpenAI API key. Your can also mannually set the key in src/get_gpt_correctness.py Line 13.

To evaluate a language model for BioGen, run the following command:

cd script
bash get_baseline_long.sh

where the names of LLMs are set in script/get_baseline_long.sh Line 3. The results will be saved in the script/log directory.

To evaluate a language model for WikiGen, run the following command:

cd script
bash get_baseline_fp.sh

where the names of LLMs are set in script/get_baseline_fp.sh Line 3. The results will be saved in the script/log directory.

launchnlp / LitCab

readme

LitCab

CaT

Evaluating LLMs on CaT