liyongsea / parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project
Apache License 2.0
7 stars 5 forks source link

[Alignment] Propose alignment algorithm draft #3

Closed liyongsea closed 1 year ago

liyongsea commented 1 year ago

Given an UN article in 6 language, propose a method to align paragraphs/sentences. some ideas:

One can inspire from

liyongsea commented 1 year ago

@Wzixiao share a small sample dataset with pdf and converted text file

voidf commented 1 year ago

一些发现:

规则:

voidf commented 1 year ago

符号表:

regular_exp = {
    # 除中文外,句子中都含空格
    'ar': re.compile(r'[\u0600-\u06ff ]+'),
    'zh': re.compile(r'[\u3006\u3007\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002ebef\U00030000-\U0003134f]+'),
    'fr': re.compile(r'[a-zA-ZÀ-Ÿ ]+'),
    'es': re.compile(r'[a-zA-ZáéíóúñÁÉÍÓÚÑüÜ ]+'),
    'ru': re.compile(r'[А-я,Ё,ё ]+'),
    'en': re.compile(r'[A-Za-z ]+'),
}

period = {
    'zh': '。',
    'ar': '،',
    # 以下都是半角ascii句点
    'en': '.',
    'es': '.',
    'fr': '.',
    'ru': '.',
}

punctuations = {
    'ar': {
        '،': '.',  # full stop
        '.': '.',  # full stop
        '!': '!',  # exclamation mark
        '؟': '?',  # question mark
        '،': ',',  # comma
        '؛': ';',  # semicolon
        ':': ':',  # colon
        '“': '"',  # left quotation marks
        '”': '"',  # right quotation marks
    },
    'zh': {
        ',': ',',
        '。': '.',
        ':': ':',
        '?': '?',
        '!': '!',
        ';': ';',
        '“': '"',
        '”': '"',
        '(': '(',
        ')': ')',
    },
}

all_punctuation_set = set(string.punctuation)
for k, v in punctuations.items():
    all_punctuation_set.update(v.keys())

digits = {
    'ar': {
        '٠': 0,
        '١': 1,
        '٢': 2,
        '٣': 3,
        '٤': 4,
        '٥': 5,
        '٦': 6,
        '٧': 7,
        '٨': 8,
        '٩': 9,
    },
    'zh': {
        '零': 0,
        '一': 1,
        '二': 2,
        '三': 3,
        '四': 4,
        '五': 5,
        '六': 6,
        '七': 7,
        '八': 8,
        '九': 9,
        '十': 10,
    }
}
voidf commented 1 year ago

__的页有注释块,含.......比较多的页为目录页。

对于目录页,可以根据.......来分割一行,把左部分的标题拿出来,这些标题行在文中几乎一定会以一个单独行或者几个断行的形式出现,我们可以去掉这些标题,防止他们影响成句。

一个例子如下,目录页:

Contents
Paragraphs Page
 I. Introduction .......................................................... 15
 II. Visit to Myanmar and related activities .................................... 2–8 5
 III. Human rights-related developments ....................................... 9–26 6
 IV. Proposed independent assessment of allegations of human rights violations in
ethnic areas of Myanmar ................................................ 27–63 11
 V. Concluding observations and recommendations ............................. 64–69 18
Annexes
 I. List of persons interviewed by the Special Rapporteur during his visit to Insein prison ....... 20
 II. Independent assessment of allegations of human rights violations in Shan State by the
Special Rapporteur on the situation of human rights in Myanmar ........................ 21

正文页:

change in this country. A number of international bodies and Member States are
assessing the scale and scope of their cooperation with the Myanmar Government.Of particular importance, Myanmar’s friends and neighbours in the Association ofSouth-East Asian Nations collectively called for the release of Daw Aung San SuuKyi during their annual Foreign Ministers meeting in June.
26. The economic and humanitarian situation remains precarious. Massive
inflation has pervaded the country as prices for commodities affecting the basiclivelihood of people had jumped. Since the beginning of the year, publictransportation fares (for airplanes, trains and buses) and telephone rates are reportedto have increased at least three times. Late-year floods have also reportedlycontributed to the increase in consumer prices for rice and other edible goods. Noofficial announcement was made on the price hikes, nor have there been any reportson a pay adjustment for civil servants. As people have sought ways and means tocope with inflation, many have reportedly lost a substantial amount of their savingsby investing in private companies lured by promises of high interest. These so-called “investment companies” have collapsed, taking with them the savings of localresidents. As the Government has lost its credibility, owing to its track record, itsattempts to restore confidence through any announcements are met with scepticismand further panic, exacerbated by a shortage of money and rumours ofdemonetization. This, in turn, has prompted a crisis of confidence in the privatebanking system, leading to a paralysis of the economy with significant consequencesfor the future prosperity of the country. Many businesses have suffered as a result.All of this throws into doubt the ability of the economy to generate the capitalessential for the successful implementation of the new and to be welcomedliberalized market in paddy and rice.
IV . Proposed independent assessment of allegations of human
rights violations in ethnic areas of Myanmar
A. Follow-up process
27. During the reporting period, the Special Rapporteur continued to pursue his
efforts to obtain access to ethnic minority areas to investigate allegations of serioushuman rights violations (on earlier efforts, see E/CN.4/2003/41, paras. 35-46).
28. In response to his communications sent to the Myanmar authorities in
November and December 2002, the Special Rapporteur received, on 22 January2003, informal suggestions from the Myanmar Mission in Geneva regarding thepossibility of the proposed independent assessment. It was suggested that theassessment be combined with his regular mission, that his team not comprise morethan five experts and that he be present in the country throughout the whole durationof the assessment mission, which could be up to three weeks.
29. In his letter of 24 January, the Special Rapporteur advised the Myanmar
Ambassador in Geneva that, owing to time constraints, in view of his othercommitments, as well as operational considerations involving the need to agree onthe detailed terms of reference for the mission, secure funding and logistics, and toidentify and recruit the experts and interpreters, he would be unable to undertake acombined mission at such short notice in March 2003. The Special Rapporteurindicated that the most appropriate option would be to take the advantage of his visit

三行标题非常少见,所以暂时考虑扫描连续两行,对所有目录算编辑距离的方式来判断它是不是标题行。如果是,则拿掉。

voidf commented 1 year ago

6 目前的过滤代码

liyongsea commented 1 year ago

文章分段和对齐,这一部分由夜夜全权负责。我有以下几个要求: 1)做出来的方案要符合现实。目前没有大算力加持的情况下,最好做相对简单的模型。能把12万文档在一周内跑通最好 2)至少对其以后做一个小规模的evaluation。自己做一个小数据集也可以,用原来的联合国数据集也可以

voidf commented 1 year ago

bertalign的东西:

voidf commented 1 year ago

5月7号进度更新

一些问题

其它方向

voidf commented 1 year ago

(可能)需要摇人的任务

liyongsea commented 1 year ago
liyongsea commented 1 year ago