The official repository for paper "LLMaAA: Making Large Language Models as Active Annotators".
openai 0.27.4
numpy 1.24.2
torch 1.10.0+cu111
tokenizers 0.10.3
transformers 4.12.4
sentence-transformers 2.2.2
kmeans-pytorch 0.3
func_timeout
ujson
tqdm
Disclaimer: Since I (the first author) didn't have access to Azure OpenAI service after internship at Langboat, I haven't test the code recently. So unfortunately I cannot guarantee that the code can be ran bug-free without any modification and please take this repository as a reference implementation.
Setup openai config @ ~/openai_config.json
. We use the Azure GPT API in our experiments, so in default you need to provide the key and base for OpenAI service.
Download data to ~/data/
. See ~/data/README.md
in the directory for details.
For active annotation (LLMaAA),
~/src/demo_retrieval.py
.~/src/active_annotate.py
. Since the demo indices are static, so the previous annotation results will be stored in an auto-generated cache file.For data generation (ZeroGen/FewGen),
~/src/data_gen.py
.For testing prompting (Prompt) performance directly,
~/src/llm_test.py
. May experience timeout/ratelimit/etc.meta.json
in data directory and configs/{dataset}.json
for annotator/generator. The configs
folder can be found in ~/src/data_synth/
(for generator) and ~/src/llm_annotator
(for annotator).~/src/demo_retrieval.py
.@inproceedings{zhang-etal-2023-llmaaa,
title = "{LLM}a{AA}: Making Large Language Models as Active Annotators",
author = "Zhang, Ruoyu and Li, Yanzeng and Ma, Yongliang and Zhou, Ming and Zou, Lei",
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.872",
doi = "10.18653/v1/2023.findings-emnlp.872",
pages = "13088--13103",
}