A well-structured summarization dataset for the Persian language consists of 93,207 records. It is prepared for Abstractive/Extractive tasks (like cnn_dailymail for English). It can also be used in other scopes like Text Generation, Title Generation, and News Category Classification. Moreover, we tested out this dataset on novel models and techniques.
Follow the rest of the repo for more detail.
Paper link: arXiv:2012.11204
Natural Language Processing (NLP) is a field of AI that focuses on processing textual information in order to make them comprehensible to computers. With the emergence of Deep Learning (DL), numerous DL-based models and architectures have been proposed for different NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis (SA), and Question/Answering (QA). One of the most recent and most popular approaches towards these tasks is to use pre-trained language models. Pre-trained language models used for NLP tasks are essentially huge neural networks employing Long Short-Term Memory (LSTM) architecture that is trained on an enormous text corpus. A few examples include BERT and T5 models. BERT is an encoder-only model that uses Masked Language Model (MLM) to create joint conditioning in the left and right context. T5 is a Sequence-to-Sequence (Seq2Seq) framework that creates a text-to-text format to address NLP tasks. However, regardless of the architecture, any pre-trained model has to be fine-tuned towards any of the NLP tasks using an appropriate dataset.
There are numerous NLP datasets available for different tasks, especially for the English language. Some tasks, however, have had a lesser fortune regarding the amount of textual data available. This lack of available data is more tangible in languages other than English. One of the NLP tasks that could highly benefit from more comprehensive and well-structured datasets is text summarization. Text summarization is a text generation problem that could be viewed as a seq2seq mapping. The most challenging issue in text summarization is to retain as much information as possible while compressing the original text into a very compact format. Decoder-only or encoder-decoder models can be utilized to address this task if the necessary dataset is available to either train or fine-tune them.
In this repository, a novel and well-structured dataset for Persian text summarization (pn-summary
) is presented. In the next few sections, first, we introduce the statistical features of this dataset. Then, we present the evaluation metrics that could be used to measure the performance of any model trained on this dataset. The results obtained from two different models in terms of said metrics are also presented.
The pn-summary
dataset comprises numerous articles of various categories that have been crawled from six news agency websites. Each document (article) includes the long original text as well as a human-generated summary. The number of articles per news agency is depicted in the below figure. The total number of cleaned articles is 93,207 (from 200,000 crawled news).
Figure 1: Number of articles per news agency.
This dataset includes 18 different article categories from economy to tourism. The distribution of these categories is shown below. The category with the highest and lowest number of articles are oil-energy and tourism, respectively. The top five categories are oil-energy, local, economy, international, and society. This is shown in figure 2.
Figure 2: Category distribution of the pn-summary dataset.
Summaries included in each article have variable lengths. As shown in the next figure, most articles have summaries with around 27 tokens. Rarely it happens that a summary has 75 tokens, and almost none have 100 or more tokens. This shows that summaries included in this dataset are sufficiently short. The distribution of summaries' token lengths in depicted in figure 3.
Figure 3: Summary token length distribution of the pn-summary dataset.
A word-cloud of the most frequent words inside pn-summary dataset can be seen in figure 4.
Figure 4: A word-cloud of the most frequent words inside the pn-summary dataset.
You can download the pn-summary from the following table.
In the following table, you can a few examples of our dataset.
Before getting into this part, please download gdown
and pandas
packages.
pip install -qU gdown
pip install -qU pandas
Downloading: Type in your terminal.
# train.csv
gdown https://drive.google.com/uc?id=10tJIalmf6hWRBbQxZeOUJ0SrvN-Pm12N
# dev.csv
gdown https://drive.google.com/uc?id=1_5pejIDMx6O2-HsWceg8zA5A8HvrYctI
# test.csv
gdown https://drive.google.com/uc?id=1D8icpwL9Oti-3EVrlCUnbPJivYGd4J5D
Loading: Type in your notebook or script.
import pandas as pd
train = pd.read_csv('pn-summary-train.csv', sep="\t")
train["article"] = train["article"].apply(lambda t: t.replace("[n]", "\n"))
train["summary"] = train["summary"].apply(lambda t: t.replace("[n]", "\n"))
print(train.shape)
dev = pd.read_csv('pn-summary-dev.csv', sep="\t")
dev["article"] = dev["article"].apply(lambda t: t.replace("[n]", "\n"))
dev["summary"] = dev["summary"].apply(lambda t: t.replace("[n]", "\n"))
print(dev.shape)
test = pd.read_csv('pn-summary-test.csv', sep="\t")
test["article"] = test["article"].apply(lambda t: t.replace("[n]", "\n"))
test["summary"] = test["summary"].apply(lambda t: t.replace("[n]", "\n"))
print(test.shape)
>>> (82022, 8)
>>> (5592, 8)
>>> (5593, 8)
Downloading: Type in your terminal.
# download pn_summary.zip
gdown https://drive.google.com/uc?id=16OgJ_OrfzUF_i3ftLjFn9kpcyoi7UJeO
# extract pn_summary
unzip pn_summary.zip
Loading: Type in your notebook or script.
import pandas as pd
train = pd.read_csv('pn_summary/train.csv', sep="\t")
train["article"] = train["article"].apply(lambda t: t.replace("[n]", "\n"))
train["summary"] = train["summary"].apply(lambda t: t.replace("[n]", "\n"))
print(train.shape)
dev = pd.read_csv('pn_summary/dev.csv', sep="\t")
dev["article"] = dev["article"].apply(lambda t: t.replace("[n]", "\n"))
dev["summary"] = dev["summary"].apply(lambda t: t.replace("[n]", "\n"))
print(dev.shape)
test = pd.read_csv('pn_summary/test.csv', sep="\t")
test["article"] = test["article"].apply(lambda t: t.replace("[n]", "\n"))
test["summary"] = test["summary"].apply(lambda t: t.replace("[n]", "\n"))
print(test.shape)
>>> (82022, 8)
>>> (5592, 8)
>>> (5593, 8)
First, you need to install datasets
use this command in your terminal:
pip install -qU datasets
Then import pn_summary
dataset using load_dataset
:
from datasets import load_dataset
data = load_dataset("pn_summary")
Or you can access the whole demonstration using this notebook:
To evaluate the performance of any model trained on the pn-summary
dataset, we suggest Google's ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric package. ROUGE metric package is a widely used automatic text summarization and machine translation evaluation. The metrics compare the generated summary with the original summary included in the article (document). Therefore, to establish the performance of any text summarization model, one can calculate the score for these metrics.
In our most recent work arXiv:2012.11204, which is the first work to address Persian text summarization from an abstractive point-of-view, we have reported the results of fine-tuning two models on the current dataset in terms of three ROUGE metrics:
ROUGE by default does not support the Persian language. Therefore, we have also created an extension to these metrics to further support the Persian language. This extension is available from here.
The models proposed to be used for Persian summary generation in our work are mT5 (a multilingual version of the T5 model) and a BERT2BERT structure warm-started with ParsBERT model's weights. This is the very first work ever that has used the pn-summary dataset. Therefore, the results reported in this work can be used as a baseline for any future work in this field that uses the pn-summary dataset. The results obtained by these models on the pn-summary dataset are presented in the table below:
Version 1.0
Version 2.0
As it can be seen from the table above, the ParsBERT-based BERT2BERT outperforms the mT5 model. This may be because ParsBERT, unlike mT5, is a monolingual BERT model that has exclusively been trained over a vast Persian text corpus capable of absorbing the Persian textual information more efficiently.
After the models are fine-tuned on the pn-summary dataset, a summarization strategy should be deployed to put the model into use actually to generate summaries. There are different decoding techniques such as greedy search and beam search. In our work, we have used the beam search method to generate summaries after fine-tuning our models.
The beam search method tries to maximize the word sequence probability by considering multiple possible sequences (beams) and choosing the one that results in a greater conditional next word probability product. This is to avoid highly probable words being neglected only because they are stuck behind a low probability word. To prevent beam search from generating the sequences with repetitive words, we have used n-grams penalties. The overall beam search configuration used in our work is outlined in the table below. In this table, the early stopping indicates whether the beam search algorithm should stop when all beams reach the EOS token.
In this section, we have included a few examples from the results of the models presented in our paper. To make these examples more comprehensible, we have included both Persian and English versions of the example texts in the table below.
As shown from the table above, the summaries are given by ParsBERT driven BERT2BERT model are quite closer to the actual summary in terms of both meaning and lexical choices.
Please cite the following paper in your publication if you are using our dataset and architectures in your research:
@article{pnSummary,
title={Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization},
author={Mehrdad Farahani and Mohammad Gharachorloo and M. Manthouri},
journal={2021 26th International Computer Conference, Computer Society of Iran (CSICC)},
year={2021},
pages={1-6},
doi={10.1109/CSICC52343.2021.9420563},
}