phsfr / UD_Persian-PerDT

Universal Persian Dependency Treebank (UPerDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections.
Creative Commons Attribution Share Alike 4.0 International
5 stars 6 forks source link

The Persian Universal Dependency Treebank (PerUDT) (v1.0)

Summary

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Please refer to the follwoing work, if you use this data:

Introduction

The Persian Universal Dependency Treebank (PerUDT) is based on Persian Dependency Treebank (PerDT) (Rasooli et al.,2013). The original Treebank consists of 29K sentences sampled from contemporary Persian text in different genres including: news, academic papers, magazine articles and fictions.

This treebank was annotated based on a language-specific schema and its automatic conversion involved three main steps: revising tokenization, POS mapping and dependency mapping.

In tokenization step, in order to separate multiword inflections of simple verbs grouped as one token in PerDT, we followed the guidelines in (Rasooli et al., 2013, Table 3) to automatically find the main verbs. Also we automatically separated pronominal clitics.

In POS conversion step, we used the state of the art BERT-based Persian NER tagger (Taher et al.,2020) with manual corrections to extend recall. Through seven different entities detected by tagger, we used Person and Location to mark PROPN tags.

PerDT contains 43 syntactic relations with no straightforward mapping for most of them, conjunctions arranged from the beginning of the sentence to the end and more importantly, prepositions regarded as the head of prepositional phrases and auxiliary verbs as the head of sentences. So we rearranged the order of conjunctions from end to the beginning through a script and tailored rules to convert each kind of relation to its UD version properly. Through the whole process and at the end of each step, we investigated the results and applied manual corrections if it was needed.

Acknowledgements

Thanks to Morteza Rezaei-Sharifabadi for helping with the copyright of this data.

Statistics of PerUDT

Split #Sent. #Tok. #word #Type Lemma #Verbs
Train 26196 459K 34.9K 20.7K 5275
Dev 1456 26K 7.0K 5.2K 1427
Test 1455 24K 6.7K 5.1K 1671
All 29107 509K 36.7K 21.6K 5413

Data Split

We followed the same split used in PerDT: 89% of data is dedicated to train while 5% of it is used for each dev and test sections.

Feedback and Bug Reports

For feedback and bug reports, please contact rasooli@seas.upenn.edu and pegh.safari@gmail.com.

References

Metadata

=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.7
License: CC BY-SA 4.0
Includes text: yes
Genre: news fiction nonfiction academic web blog
Lemmas: converted with corrections
UPOS: converted from manual
XPOS: manual native
Features: converted with corrections
Relations: converted with corrections
Contributors: Rasooli, Mohammad Sadegh; Safari, Pegah; Moloodi, Amirsaeid; Nourian, Alireza
Contributing: here source
Contact: rasooli@seas.upenn.edu, pegh.safari@gmail.com
===============================================================================