There are many achievements in the Natural Language field for English or even other languages, but in the case of Persian, as you can see, there are not many. Perhaps the lack of data resources, non-disclosure of sources by research groups, and decentralized research communities are the main reasons for Persian's current state.
Of course, it should be noted that some of the research groups concern about this matter and share their results, thoughts, and resources with others, but still, we need more.
NLI (known as recognizing textual entailment) resources for Persian are vital for every semantic, extraction, and inference system. I was so excited when FarsTail, as the first NLI dataset for Persian, was released. I used this dataset to train a Sentence-Transformer model (using ParsBERT) as a basis for other applications like Semantic Search, Clustering, Information Extraction, Summarization, Topic Modeling, and some others. However, the model could be achieved remarkable results on recognizing entailment (81.71%) in contrast to what they mentioned in their paper (78.13%), still not adequate for NLI applications.
I dug in the official paper of Sentence-Transformer Reimers and Gurevych, 2019. I found that it used the Wikipedia-Triplet-Sections, introduced by Dor et al., 2018, to train the SBERT for recognizing entailment task. Dor et al., 2018, presume that sentences in the same section are thematically closer than sentences in different sections. They presented the anchor () and the positive () example from the same section, while the negative example () comes from a separate section of the same article. They designated the following steps to generate this sentences-triplet dataset (In each rule, I would specify whether to use the principal or not):
Reimers and Gurevych, 2019 use the dataset with a Triplet Objective to train the SBERT.
Eq 1: Triplet Objective Function, try to minimize the above loss function.
Tips: SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed-sized sentence embedding. They experimented with three pooling strategies:
In this case, I use Mean-Strategy.
In the following parts, I would show you how to do these rules step by step. Before going any further, I noticed that some of the Wikipedia articles are entirely English or other languages than Persian, like "اف_شارپ", "سی_شارپ", and some others which must be removed. So, I add a bunch of preprocessing steps into the above rules.
The preprocessing steps are as follow:
A Wikipedia article sample is shown in Fig 1. The red boxes were removed due to the Dor et al., 2018 rules.
Fig 1: Wikipedia Article Sample "جان میلینگتون سینگ"
The following figure (Fig 2) shows the article after passing the mutated Dor et al., 2018 rules and preprocessing steps known as the Wikipedia-Section-Paragraphs.
Fig 2: Wikipedia-Section-Paragraphs.
Then, we need to convert the section-paragraphs into section-sentences in order to have a recognizing entailment dataset. The following steps need to replace with some of the rules defined by Dor et al., 2018.
The following figure (Fig 3) presents the article after passing the mutated rules known as the Wikipedia-Section-Sentences.
Fig 3: Wikipedia-Section-Sentences.
Then, I compose a combination of sections in an article with a distance of at least two segments concerning their orders. Suppose that we have an article with four sections. The outcome of this composition shown as follow:
sections = ['Section 1', 'Section 2', 'Section 3', 'Section 4']
composition = [['Section 1', 'Section 4'], ['Section 1', 'Section 3'], ['Section 2', 'Section 4']]
Each pair-sections shows the order of sentence extraction. For example, the pair ['Section 1', 'Section 4'] specifies that the anchor and positive examples must be chosen from Section 1
and the negative example from Section 4
. Also, consider that the selected anchor and positive examples from Section 1
should be chosen from paragraphs with a distance of at most two in that section, shown in Fig 4.
Figure 4: Wikipedia-Triplet-Sentences.
Examples
Sentence1 () | Sentence2 () | Sentence3 () |
---|---|---|
جنبش های اجتمای دیگر ، از جمله موج اول فمینیسم ، رفرم اخلاقی و جنبش های میانه رو نیز در توسعه ونکوور مؤثر بودند . | ادغام پوینت گری و ونکوور جنوبی به شهر ونکوور ، آخرین مرزبندی های شهری را رقم زد و مدتی بعد آن را به سومین کلان شهر کانادا تبدیل کرد . | در سال ۲۰۰۸ ، در میان ۲۷ کلان شهر کانادا ، ونکوور هفتمین آمار جرم و جنایت را داشت که از سال ۲۰۰۵ ، سه پله سقوط کرده بود . |
یکی از ویژگی های مهم سیستم های مالیاتی ، درصد بار مالیاتی مرتبط با درآمد یا مصرف است . | یک مالیات صعودی ، مالیاتی است که به گونه ای اعمال می شود که وقتی مبلغی که به آن مالیات اعمال می شود افزایش می یابد ، نرخ مالیات مؤثر نیز افزایش می یابد . | اضافه رفاه از دست رفته باعث تنظیم مالیات ها در تراز کردن (فرصت ها در) زمین بازی تجاری نمی شود . |
As far as this mutated method can understand the thematic, we could use a similar procedure to extract the D/Similar dataset, shown in Fig 5.
Figure 5: Wikipedia-D/Similar.
Examples
Sentence1 | Sentence2 | Label |
---|---|---|
در جریان انقلاب آلمان در سال های ۱۹۱۸ و ۱۹۱۹ او به برپایی تشکیلات فرایکورپس که سازمانی شبه نظامی برای سرکوب تحرکات انقلابی کمونیستی در اروپای مرکزی بود ، کمک کرد . | کاناریس بعد از جنگ در ارتش باقی ماند ، اول به عنوان عضو فرایکورپس و سپس در نیروی دریایی رایش.در ۱۹۳۱ به درجه سروانی رسیده بود . | similar |
در جریان انقلاب آلمان در سال های ۱۹۱۸ و ۱۹۱۹ او به برپایی تشکیلات فرایکورپس که سازمانی شبه نظامی برای سرکوب تحرکات انقلابی کمونیستی در اروپای مرکزی بود ، کمک کرد . | پسر سرهنگ وسل فرییتاگ لورینگوون به نام نیکی در مورد ارتباط کاناریس با بهم خوردن توطئه هیتلر برای اجرای آدمربایی و ترور پاپ پیوس دوازدهم در ایتالیا در ۱۹۷۲ در مونیخ شهادت داده است . | dissimilar |
شهر شیراز در بین سال های ۱۳۴۷ تا ۱۳۵۷ محل برگزاری جشن هنر شیراز بود . | جشنواره ای از هنر نمایشی و موسیقی بود که از سال ۱۳۴۶ تا ۱۳۵۶ در پایان تابستان هر سال در شهر شیراز و تخت جمشید برگزار می شد . | similar |
شهر شیراز در بین سال های ۱۳۴۷ تا ۱۳۵۷ محل برگزاری جشن هنر شیراز بود . | ورزشگاه پارس با ظرفیت ۵۰ هزار تن که در جنوب شیراز واقع شده است . | dissimilar |
Version 1.0.0
Version | Examples | Titles | Sections |
---|---|---|---|
1.0.0 | 205,768 | 21,515 | 34,298 |
Version | Train | Dev | Test |
---|---|---|---|
1.0.0 | 180,585 | 5,586 | 5,758 |
Version | Train | Dev | Test |
---|---|---|---|
1.0.0 | 126,628 | 5,277 | 5,497 |
The following table summarizes the scores obtained by each dataset and model.
Model | Dataset | Metrics (%) |
---|---|---|
parsbert-base-wikinli-mean-tokens | wiki-d-similar | Accuracy: 76.20 |
parsbert-base-wikinli | wiki-d-similar | F1: 77.84, Accuracy: 77.84 |
parsbert-base-wikitriplet-mean-tokens | wikitriplet | Accuracy Cosinus: 93.33, Accuracy Manhatten: 94.40, Accuracy Euclidean: 93.31 |
parsbert-base-uncased-farstail | farstail | F1: 81.65, Accuracy: 81.71 |
bert-fa-base-uncased-farstail-mean-tokens | farstail | Accuracy: 56.45 |
Application | Notebook |
---|---|
Semantic Search | |
Clustering | |
Text Summarization | |
Information Retrieval | |
Topic Modeling |
2.0.0: New Version 🆕 !
1.0.0: Hello World!
Please cite this repository in publications as the following:
@misc{PersianSentenceTransformers,
author = {Mehrdad Farahani},
title = {Persian - Sentence Transformers},
month = dec,
year = 2020,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.4850057},
url = {https://doi.org/10.5281/zenodo.4850057}
}