SPBERT: A Pre-trained Model for SPARQL Query Language

<br />

In this project, we provide the code for reproducing the experiments in our paper. SPBERT is a BERT-based language model pre-trained on massive SPARQL query logs. SPBERT can learn general-purpose representations in both natural language and SPARQL query language and make the most of the sequential order of words that are crucial for structured language like SPARQL.

Prerequisites

To reproduce the experiment of our model, please install the requirements.txt according to the following instructions:

transformers==4.5.1
pytorch==1.8.1
python 3.7.10
```
$ pip install -r requirements.txt
```

Pre-trained models

We release three versions of pre-trained weights. Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. You can download all versions from the table below:	Pre-training objective	Model	Steps
MLM	SPBERT (scratch)	200k	🤗 razent/spbert-mlm-zero
MLM	SPBERT (BERT-initialized)	200k	🤗 razent/spbert-mlm-base
MLM+WSO	SPBERT (BERT-initialized)	200k	🤗 razent/spbert-mlm-wso-base

Datasets

All evaluation datasets can download here.

Example

To fine-tune models:

python run.py \
        --do_train \
        --do_eval \
        --model_type bert \
        --model_architecture bert2bert \
        --encoder_model_name_or_path bert-base-cased \
        --decoder_model_name_or_path sparql-mlm-zero \
        --source en \
        --target sparql \
        --train_filename ./LCQUAD/train \
        --dev_filename ./LCQUAD/dev \
        --output_dir ./ \
        --max_source_length 64 \
        --weight_decay 0.01 \
        --max_target_length 128 \
        --beam_size 10 \
        --train_batch_size 32 \
        --eval_batch_size 32 \
        --learning_rate 5e-5 \
        --save_inverval 10 \
        --num_train_epochs 150

To evaluate models:

python run.py \
        --do_test \
        --model_type bert \
        --model_architecture bert2bert \
        --encoder_model_name_or_path bert-base-cased \
        --decoder_model_name_or_path sparql-mlm-zero \
        --source en \
        --target sparql \
        --load_model_path ./checkpoint-best-bleu/pytorch_model.bin \
        --dev_filename ./LCQUAD/dev \
        --test_filename ./LCQUAD/test \
        --output_dir ./ \
        --max_source_length 64 \
        --max_target_length 128 \
        --beam_size 10 \
        --eval_batch_size 32 \

Contact

Email: heraclex12@gmail.com - Hieu Tran

Citation

@inproceedings{Tran2021SPBERTAE,
  title={SPBERT: An Efficient Pre-training BERT on SPARQL Queries for Question Answering over Knowledge Graphs},
  author={Hieu Tran and Long Phan and James T. Anibal and Binh Thanh Nguyen and Truong-Son Nguyen},
  booktitle={ICONIP},
  year={2021}
}

heraclex12 / NLP2SPARQL