mayhugotong / GenTKG

This is the official impletation repository of NAACL findings paper, GenTKG: Generative Forecasting on Temporal Knowledge Graph with Large Language Models. (https://arxiv.org/abs/2310.07793).
28 stars 5 forks source link

GenTKG: Generative Forecasting on Temporal Knowledge Graph with Large Language Models

Introduction

This is the official implementation repository of NAACL findings paper, GenTKG: Generative Forecasting on Temporal Knowledge Graph with Large Language Models

This work is about fine-tuning the large language model llama2-7B with peft and using it for temporal knowledge graph (tkg) forecasting. The training and evaluation data used are obtained by TLR retreival, and the FIT trained model weights are stored on Google Drive.

Setup

Environment

Download the codes and go to the folder:

git clone https://github.com/mayhugotong/GenTKG.git
cd GenTKG

Create an environment:

conda create -n gtkg python=3.8
conda activate gtkg
pip install -r requirements.txt 
pip install git+https://github.com/huggingface/peft.git

Download data and models from Google Drive and then unzip and save them in folders "data" and "model".

You can use gdown to do it:

pip install gdown
gdown https://drive.google.com/file/d/1C63Ugg_Xc1MGgeToiYNM0X4i35CJUEWA/view?usp=sharing
unzip data.zip -d .
gdown https://drive.google.com/file/d/1mpzlfKLuh3cHvox8UpP1RkPUHKeN4_eL/view?usp=sharing
unzip model.zip -d .
gdown https://drive.google.com/file/d/145avybZXtlTrshVBJ22B6KSJPnK5nQVS/view?usp=sharing
unzip model_backup.zip -d .

Lexical datasets

Before anything, you might want to create datasets in lexicons instead of in ids. For example, for the train file of icews14:

python ./data_utils/id_words.py --file_to_convert ./data/icews14/train.txt --path_output ./data/processed_new/icews14/train.txt --dataset icews14 --period 24

Rules learning parameters:

By default you will create any new datasets of your own in ./data/processed_new/ .

Rules learning

The rules learning part is originally from Tlogic rules learning codes. It runs on lexical datasets (although it just convert them into ids). By default it only reaches datasets in ./data/ instead of ./data/processed_new/ . You can produce other rule banks besides the provided ones by running e.g. for icews14:

cd data_utils/rules_learning
python3 learn.py -d icews14 -l 1 2 3 -n 200 -p 15 -s 12

Rules learning parameters:

You will get a rule bank file similar to "060723022344_r[1,2,3]_n200_exp_s12_rules.json" under the ./output/ folder.

History retrieving

Find the file name of rule bank json (in ./output) and run from the folder GenTKG:

cd GenTKG
python3 ./data_utils/retrieve.py --name_of_rules_file name_rules.json --dataset icews14

An example for icews18 would be like:

python ./data_utils/retrieve.py --name_of_rules_file 060723022344_r[1]_n200_exp_s12_rules.json --dataset icews18

By default you will create these following files:

For training, you need to convert history_facts files into lora json file:

python3 ./data_utils/create_json_train.py --dir_of_trainset 'the_full_trainset_to_convert (see [A])' --dir_of_answers 'the_test_answers (see [B])' --dir_of_entities2id 'the_json_of_entities2id (see [C])' --path_save 'better_the_same_as_the_trainset_before_converting'

An example for icews18 would be like:

python ./data_utils/create_json_train.py --dir_of_trainset './data/processed_new/icews18/train/history_facts/history_facts_icews18.txt' --dir_of_answers './data/processed_new/icews18/train/test_answers/test_answers_icews18.txt' --dir_of_entities2id './data/icews18/entity2id.json' --path_save './data/processed_new/icews18/train'

Training

Basic training:

python3 main.py --OUTPUT_DIR "your_output_directory" --DATA_PATH "path_of_dataset_file"

Example for training:

python3 main.py --OUTPUT_DIR "./model/output_model_icews14_1024" --DATA_PATH "./data/processed/train/icews14/icews14_1024.json"

Training parameters (in config.py):

If you want to use logging platform like WandB, you may need these:

Test

Basic test:

python3 inference.py --LORA_CHECKPOINT_DIR "path of model checkpoint" --output_file "your output directory" --input_file "path of history_facts file" --test_ans_file "path of test_answers file"

Example for testing:

python3 main.py --LORA_CHECKPOINT_DIR "./model/icews14" --output_file "./results/prediction_icews14.txt"  --input_file "./data/processed/eval/history_facts/history_facts_icews14.txt"  --test_ans_file "./data/processed/eval/test_answers/test_ans_icews14.csv"

Testing parameters (in eval_utils.py):

If you want to begin from a certain i-th question (like resuming):

File structure

Repository

The repository contains codes for both TLR and few-shot instruction-tuning llama2 and inference. Learned rule banks are also provided here:

Root
|--data_utils/
    |--rules_learning/ (codes from [Tlogic](https://github.com/liu-yushan/TLogic))
    |--basic.py (utils for data reading/writing etc)
    |--create_json_train.py (convert dataset into lora json format)
    |--id_words.py (convert between id and lexical entities, relations and timestamps)
    |--retrieve.py (data reading/writing and so on for retrieving)
    |--TLR.py (retrieve history according to rules)
|--llama2_ori_repo/ (In-context Learning codes for llama2; imported in evaler.py)
|--minimal20b/ (In-context Learning codes for gpt-neox; imported in evaler.py)
|--output/ (contains rules banks from Tlogic rules learning)
|--results/ (stores inference results; empty)
    |--config.py
    |--eval_utils.py
    |--evaler.py
    |--inference.py (inference)
    |--main.py (training)
    |--neox.py (gpt-neox inference)
    |--utils.py

Datasets

The structure should be similar to this:

Datasets
|--processed/
    |--train/ (trainsets for Gentkg; JSON files)
        |--icews14/
            |--icews14.json (full set)
            |--icews14_16.json (sampled set)
            ...
            |--icews14_1024.json (sampled set)
        |--icews18/
        ...
    |--eval/
        |--history_facts/
            |--history_facts_icews14.txt
            |--history_facts_icews18.txt
            |--history_facts_GDELT.txt
            |--history_facts_YAGO.txt
        |--test_answers/
            |--test_ans_icews14.csv
            |--test_ans_icews18.csv
            |--test_ans_GDELT.txt
            |--test_ans_YAGO.txt
|--original/ (original datasets mainly for rule-based models)
    |--icews14/
        |--all_facts.txt
        |--train.txt
        |--valid.txt
        |--test.txt
        |--stat.txt
        |--entity2id.json (JSON as dictionary format; for GenTKG) [C]
        |--relation2id.json (JSON as dictionary format; for GenTKG)
        |--ts2id.json (JSON as dictionary format; for GenTKG)
    |--icews18/
        |--all_facts.txt
        |--train.txt
        |--valid.txt
        |--test.txt
        |--stat.txt
        |--entity2id.json
        |--relation2id.json
        |--ts2id.json
    ...

Data format

Training json

{"context":question1, "target":answer1}{"context":question2, "target":answer2}...

Files for testing

The file format is as follows:

history_facts:

history1.1
history1.2
history1.3
...
query1

history2.1
history2.2
history2.3
...
query2

...
...

test_ans:

query_answer1
query_answer2
query_answer3
...

Reference

liu-yushan/TLogic: Temporal Logical Rules for Explainable Link Forecasting on Temporal Knowledge Graphs (github.com)

Fine_Tuning_LLama | Kaggle

FreddyBanana/ChatGLM2-LoRA-Trainer: Simple 4-bit/8-bit LoRA fine-tuning for ChatGLM2 with peft and transformers.Trainer. (github.com)

mymusise/ChatGLM-Tuning: An affordable chatgpt implementation solution, based on ChatGLM-6B + LoRA (github.com)

Citation

Please cite our work as follow if you find our work helpful.

@inproceedings{liao2024gentkg,
  title={GenTKG: Generative Forecasting on Temporal Knowledge Graph with Large Language Models},
  author={Liao, Ruotong and Jia, Xu and Li, Yangzhe and Ma, Yunpu and Tresp, Volker},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},
  pages={4303--4317},
  year={2024}
}