uclanlp / DeepKPG

Deep Keyphrase Generation with Pre-trained Language Models
MIT License
23 stars 0 forks source link

DeepKPG

News

Introduction

We provide support for a range of Deep Keyphrase Generation and Extraction methods with Pre-trained Language Models (PLMs). This repository contains the code for two papers:

The methods and models we cover as follows

For semantic-based evaluation, please refer to KPEval.

If you find this work helpful, please consider citing

@article{https://doi.org/10.48550/arxiv.2212.10233,
  doi = {10.48550/ARXIV.2212.10233},
  url = {https://arxiv.org/abs/2212.10233},
  author = {Wu, Di and Ahmad, Wasi Uddin and Chang, Kai-Wei},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study},
  publisher = {arXiv},
  year = {2022}, 
  copyright = {Creative Commons Attribution 4.0 International}
}

Getting Started

This project requires a GPU environment with CUDA. We recommend following the steps below to use the project.

Set up a conda environment

conda create --name deepkpg python==3.8.13
conda activate deepkpg

Install the packages

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install --upgrade -r requirements.txt

Install apex (optional):

git clone https://github.com/NVIDIA/apex
cd apex
export CXX=g++
export CUDA_HOME=/usr/local/cuda-11.3
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
cd ..

Data Preparation

We support a range of keyphrase datasets. They can be prepared by simply running the corresponding run.sh in corresponding folders. Detailed instructions and introduction of the datasets can be found here.

Keyphrase Extraction via Sequence Tagging

Dependencies

After setting up the PyTorch environment as described in the steps above, you can set up the environment for sequence tagging by running the command below. The main difference is transformers==3.0.2.

pip install -r sequence_tagging/requirements.txt

Training a PLM-based keyphrase extraction model

BERT2BERT

Dependencies

After setting up the PyTorch environment as described in the steps above, you can set up the environment for sequence tagging by running the command below. The main difference is transformers==4.2.1 datasets==1.1.1 accelerate==0.10.0.

pip install -r sequence_generation/bert2bert/requirements.txt

Training a BERT2BERT style keyphrase generation model

UniLM

Dependencies

After setting up the PyTorch environment as described in the steps above, you can set up the environment for sequence tagging by running the command below. The main difference is transformers==3.0.2.

pip install -r sequence_generation/unilm/requirements.txt

Training a UniLM style keyphrase generation model with encoder-only PLMs

Keyphrase Generation with sequence-to-sequence PLMs

Dependencies

After setting up the PyTorch environment as described in the steps above, you can directly run sequence generation experiments in the sequence_generation/seq2seq folder.

Training, Inference, and Evaluation

SciBART

We pre-train BART-base and BART-large from scratch using paper titles and abstracts from a scientific corpus S2ORC. The pre-training was done with fairseq and the model is converted to huggingface and released here - uclanlp/scibart-base and uclanlp/scibart-large.

As we train a new vocabulary from scratch on the S2ORC corpus using sentencepiece, SciBart is incompatible with the original BartTokenizer. We are submitting a pull request to huggingface to include our new tokenizer. For now, to use SciBart, you can clone and install transformers from our own branch:

git clone https://github.com/xiaowu0162/transformers.git -b scibart-integration
cd transformers
pip install -e .

Then, you may use the model as usual:

from transformers import BartForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('uclanlp/scibart-large')
model = BartForConditionalGeneration.from_pretrained('uclanlp/scibart-large')
print(tokenizer.batch_decode(model.generate(**tokenizer('This is an example of a <mask> computer.', return_tensors='pt'))))

NewsBART

We continue pre-train facebook's BART-base on the realnews dataset without changing the vocabulary. More details regarding the pre-training can be found in our paper. The model is released on huggingface model hub. Fine-tuning it for keyphrase generation is supported in sequence_generation/seq2seq.

NewsBERT

We continue pre-train bert-base-uncased on the realnews dataset without changing the vocabulary. More details regarding the pre-training can be found in our paper. The model is released on huggingface model hub. Fine-tuning it for keyphrase extraction or generation is fully supported in sequence_tagging, sequence_generation/unilm, and sequence_generation/bert2bert.