wang-research-lab / deepstruct

Code repo for ACL22 paper "DeepStruct: Pretraining of Language Models for Structure Prediction"
https://arxiv.org/pdf/2205.10475.pdf
Apache License 2.0
80 stars 4 forks source link

DeepStruct: Pretraining of Language Models for Structure Prediction

Source code repo for paper DeepStruct: Pretraining of Language Models for Structure Prediction, ACL 2022.

Setup Environment

DeepStruct is based on GLM dependency. Please use GLM's docker as follow to setup the basic GPU environment (zxdu20/glm-cuda112 for Ampere GPUs and zxdu20/glm-cuda102 for older version GPUs such as Tesla V100).

git clone --recursive git@github.com:cgraywang/deepstruct.git
cd ./deepstruct

docker run --net=host --privileged --pid=host --gpus all --rm -it --ipc=host -v ./deepstruct:/workspace/deepstruct zxdu20/glm-cuda112
cd /workspace/deepstruct

and install the dependency via setup.sh:

bash setup.sh

The final directory structure should be as follows:

workspace/
├─ deepstruct/
├─ data/
├─ ckpt/

Download Checkpoints

Most of our experiments are based on 10-billion-parameter DeepStruct checkpoint. Run the following shell scripts to download all multi-task trained DeepStruct checkpoints from huggingface hub (might take a while).

bash download_ckpt.sh

Data Preparation & Reproduce

To run following experiments on DeepStruct-10B, our experiments adopt batch_size_per_gpu=1 and require at least 32 GB GPU memory to run. The scripts default use --num-gpus-per-node=1 in src/tasks/mt/*.sh, and if you want to use multiple gpu for acceleration, please customize it in src/tasks/mt/*.sh.

Notice that CoNLL12, CoNLL05 for semantic role labeling, ACE2005 for event extraction require manual download from LDC (LDC2006T06, LDC2013T19, PTB-3).

Task Dataset Data preparation Multi-task Result
Joint entity and relation extraction CoNLL04 bash run_scripts/conll04.sh Ent. 88.4/Rel. 72.8
Joint entity and relation extraction ADE bash run_scripts/ade.sh Ent. 90.5/Rel. 83.6
Joint entity and relation extraction NYT bash run_scripts/nyt.sh Ent. 95.4/Rel. 93.7
Joint entity and relation extraction ACE2005 bash run_scripts/ace2005_jer.sh <abs_path_to_LDC2006T06> Ent. 90.2/Rel. 58.9
Semantic role labeling CoNLL05 WSJ bash run_scripts/conll05_srl_wsj.sh <abs_path_to_PTB_3> 95.5
Semantic role labeling CoNLL05 Brown bash run_scripts/conll05_srl_brown.sh <abs_path_to_PTB_3> 92.0
Semantic role labeling CoNLL12 bash run_scripts/conll12_srl.sh <abs_path_to_LDC2013T19> 97.2
Event extraction ACE2005 bash run_scripts/ace2005event.sh <abs_path_to_LDC2006T06> Trigger: Id-72.7/Cl-69.2 Argument: Id-67.5/Cl-63.9
Intent detection ATIS bash run_scripts/atis.sh 97.3
Intent detection SNIPS bash run_scripts/snips.sh 97.4
Dialogue state tracking MultiWOZ 2.1 bash run_scripts/multi_woz.sh 53.5

Arguments in running scripts

The arguments in src/tasks/mt/*.sh configure the training and inference of DeepStruct. Here are their meanings:

Scripts for Pretraining

Following the commands below to prepare pretraining data and run training.

# prepare pretraining data
bash data_scripts/PRETRAIN.sh

# run pretraining
cd ./glm/
bash scripts/ds_finetune_seq2seq_pretrain.sh config_tasks/<MODEL_TYPE>.sh config_tasks/pretrain.sh cnn_dm_original

Currently <MODEL_TYPE> supports model_blocklm_10B_pretrain, which refers to the 10 billion pretrained model as backbone.

Please customize NUM_GPUS_PER_WORKER in glm/scripts/ds_finetune_seq2seq_pretrain.sh and train_micro_batch_size_per_gpu in glm/config_tasks/config.json according to your environment, as fine-tuning a 10B language model requires quite sufficient GPU memory. The data preprocessing for pretraining may require over 600G main memory, as the current dataloader implementation preloads all tokenized data into main memory in pretraining.

Citation

@inproceedings{wang-etal-2022-deepstruct,
    title = "{D}eep{S}truct: Pretraining of Language Models for Structure Prediction",
    author = "Wang, Chenguang  and
      Liu, Xiao  and
      Chen, Zui  and
      Hong, Haoyun  and
      Tang, Jie  and
      Song, Dawn",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    year = "2022",
}