This is the repository for the BART summarization tool discussed in the paper Question-Driven Summarization of Answers to Consumer Health Questions.
This installation assumes Python >= 3.6 and anaconda or miniconda is installed.
To install anaconda, see instructions here https://docs.anaconda.com/anaconda/install/
EZ-BART will run on either Windows or Linux, but the installation for anaconda varies per platform.
For windows, you can download the installer here and follow the prompts.
Note that this install works only for running inference on a CPU. For training on a GPU, see below. Additionally, you can run inference on a GPU, which will be faster. See the training instructions for the GPU install.
git clone https://github.com/saverymax/EZ-BART.git
cd EZ-BART
conda create --name ez_bart python=3.7 pytorch torchvision cpuonly -c pytorch
conda activate ez_bart
pip install -r requirements.txt
cd bart
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
You can use BART to generate summaries with or without a question guiding the content. For question-driven summarization, you have two options. If you have a single question you want to use for a set of documents, you can provide your data as shown below, and specify your question with the --question flag (see flag descriptions).
{
"<UNIQUE ARTICLE ID 1>": "This is the text of the article to be summarized",
...
"<UNIQUE ARTICLE ID n>": "This is the text of another article to be summarized",
}
data_processing/data/sample_data.json contains an example of this data structure.
Your other option for question-driven summarization is that, if you have question-document pairs where each question is unique for each document, you can use the --q-per-doc option instead, and provide your data in the format below:
{
"<UNIQUE ARTICLE ID n>":
{
'query': "Text of question for a single document",
'document': "This is the text of the article to be summarized",
...
}
The keys in the json you provide must be 'query' and 'document'. See the provided datasets in the directories in data_processing/data for examples.
To run the model, first download the fine-tuned BART weights from https://bionlp.nlm.nih.gov/bart_finetuned_checkpoint.zip. Once they are unzipped, you can specify the path to them as shown in the example below. Once your data is in the correct format for your use-case and you have downloaded the model, you are ready to summarize! Activate your environment if it is not already activated
conda activate ez_bart
And then from the base directory of this repository (EZ-BART), you can try running
python -m bart.run_inference --question="Do I have COVID-19?" --prediction_file=bart/predictions/bart_summs.json --model_path=bart/checkpoints_bioasq_with_question --data=data_processing/data/sample_data.json --model_config=bart/bart_config/with_question/bart-bin
Or for question-document pairs:
python -m bart.run_inference --q-per-doc --prediction_file=bart/predictions/bart_summs.json --model_path=bart/checkpoints_bioasq_with_question --data=data_processing/data/sample_data.json --model_config=bart/bart_config/with_question/bart-bin
This example assumes you have stored the downloaded weights (checkpoints_bioasq_with_question/checkpoint_best.pt) in the EZ-BART/bart directory, but you can store the weights wherever is convenient for your system.
--question
The question you would like to drive the content of the summary.
--q-per-doc
The above option sets a single question for all source documents you provide. If you instead have question-document pairs, you can provide the data as specified above and use this option. Note that the --question flag will have no effect if using --q-per-doc.
--prediction_file
The path of the file you would like the summaries saved to.
--model_path
The path to the downloaded BART weights. Include only the path to the directory the model is stored in, not the .pt file itself.
--data
The path to the json with the data you would like to be summarized.
--model_config
The path to the configuration for the model. If using the fine-tuned weights, this is optional to include since the default specifies the correct path. The path will need to be adjusted if you are fine-tuning the weights yourself.
Should you want to use weights fine-tuned on data other than BioASQ, the instructions here demonstrate how to retrain the BART model with your own data, or on a selection of biomedical datasets included in this repository, described in the datasets section.
A GPU with 32gb of VRAM is required to train BART. This memory requirement assumes a maximum source document sequence length of 1024 subword tokens, and a batch size of 8. You can feasibly use a 16gb GPU with a smaller sequence length instead, and will not suffer performance degredation to a large degree with a maximum sequence length of 512.
Let's create a new environment for training:
conda create --name ez_bart_train python=3.7 pytorch torchvision cudatoolkit -c pytorch
conda activate ez_bart_train
pip install -r requirements.txt
cd bart
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
Note that the GPU version of pytorch will be installed, with the CUDA tool kit. It is also possible to install Nvidia's Apex library for mixed precision training, but as this install depends heavily on your individual system, we do not include it in the instructions here. See https://github.com/saverymax/qdriven-chiqa-summarization or https://github.com/pytorch/fairseq for instructions on an install that might work for you.
Also you will need download the original (i.e., not fine-tuned on BioASQ) BART large weights
wget -P bart/ https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
tar -C bart/ -xzvf bart/bart.large.tar.gz
Using your own training data is encouraged, but because of the unique content and structure of each dataset, it may require custom processing on your end. Once your dataset has been formatted in the following structure, you can train with it after one more fairseq-specific processing step.
First make sure your data is in the correct json format, with 'query', 'source_document', and 'summary' keys.
{
'example_id_1': {'query': "query text", 'document': "document to be summarized", 'summary': "summary of the document},
...
'example_id_n': {'query': "query text", 'document': "document to be summarized", 'summary': "summary of the document},
}
It is not necessary for the value of the query key be a query per se; you may use a topic or keyword to focus the content of the summary. It is also possible to not include the query key in the data, but make sure to leave out the --add-q option in the processing command. Additionally, you will need to remove all newlines ("\n") from your data (for the tokenization step using fairseq).
For futher information regarding processing your dataset, a working example of processing the BioASQ and Medinfo data into the correct format is included in the data_processing directory, in process_bioasq.py and process_medinfo.py.
To prepare data files for BART, run the following. This formats the data to so that fairseq can efficiently process it during training. Note that we have provided a few pre-formatted datasets that you may use. See the datasets section below for their description. For example, to train with the BioASQ data (the data used to finetune the weights we provide), you can run the following command. To use your data, just replace with --train_path and --val_path with your training and validation data, assuming it has been formatted correctly, and specify a config path. The prepare_training_data.py script will format the data for BART and place the processed data into the expected directories.
python -m data_processing.prepare_training_data --train_path=data_processing/data/bioasq/bioasq_collection.json --val_path=data_processing/data/medinfo/medinfo_section2answer_validation_data.json --config_path=/path/to/config --add_q
Specify the directory you want to the config to be saved in. The script will create the path if the directory does not exist. Once you run the python command, another processing step is necessary for tokenization, as this is handled by the fairseq library. Using bash, run
bash data_processing/make_bart_data.sh /path/to/config/used/above
This will tokenize and prepare the training and validation data for the model to efficiently process. The script will download encoder.json, vocab.bpe, and dict.txt from a Facebook cloud service. It will also create the bart-bin directory within your config directory, and you will have to include the bart-bin directory when you specify the config for training and inference.
Now you are ready to train! Using bash,
BART_CONFIG=path/to/your/config
CHECKPOINT_DIR=path/to/save/checkpoints
CUDA_VISIBLE_DEVICES=0 python -m bart.train $BART_CONFIG/bart-bin \
--save-dir=$CHECKPOINT_DIR \
--restore-file bart/bart.large/model.pt \
--max-tokens 1024 \
--truncate-source \
--task translation \
--source-lang source --target-lang target \
--layernorm-embedding \
--share-all-embeddings \
--share-decoder-input-output-embed \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--arch bart_large \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
--clip-norm 0.1 \
--lr-scheduler polynomial_decay --lr 3e-05 --total-num-update 20000 --warmup-updates 500 \
--update-freq 16 \
--skip-invalid-size-inputs-valid-test \
--ddp-backend=no_c10d \
--keep-last-epochs=1 \
--find-unused-parameters;
If you have a environment suitable for mixed precision training, include the --fp16
option as well.
Once your model is trained, you can use it for inference as described in the first section. Just make sure to specify the path to your new weights and config, including the /bart-bin directory as the final directory in the config path. Also note that inference will be done on CPU. You can add bart.cuda()
on the line before bart.eval(), in the inference script, with a GPU environment to run on the GPU.
The datasets included in this repository are described below. They can be used to train medical summarization systems.
Happy summarization!