7.11——GEC相关数据库整理

1. NUCLE（非官方提供，2.2版本）

文件夹1：data 【包含预处理后的NUCLE，预处理步骤分为4步】

nucle.sorted.relabeled.sgml

NUCLE语料库，character offsets
conll13st-preprocessed.conll

前三步的结果，包含在CoNLL样式列格式中的预处理数据
conll13st-preprocessed.conll.ann

第四步的结果，包含token级别的注释
conll13st-preprocessed.m2

Scorer的标准格式，token offsets

文件夹2：data_5types 【data中5个重点错误标签的数据】

文件夹3：m2scorer 【一种评价方法】

相关论文：https://www.aclweb.org/anthology/N12-1067
Github：https://github.com/nusnlp/m2scorer

文件夹4：scripts【处理脚本】

2. Lang-8（非官方提供）

文件夹1：lang-8-20111007-2.0 【原始数据，包含1个文件】

lang-8-20111007-L1-v2.dat

其数据结构如下：

["journal_id",
  "sentence_id",
  "learning_language",
  "native_language",
  ["learner_sentence1","learner_sentence2",...],
  [["correction1_to_sentence1","correction2_to_sentence1",...],
   ["correction1_to_sentence2","correction2_to_sentence2",...],
   ...],
]

文件夹2：lang-8-en-1.0 【英文数据，去除了矫正数据，包含3种文件】

entries.(train|test)

train和test文件分别有100,000和1,000个条目，用空格分开，每行是每个句子的信息，每列信息如下：

Column            Description
-------------------------------------------------
1                 Number of correction
2                 Serial number
3                 the URL of the entry
4                 Sentence number. 0 is the title
5                 Sentence written by a learner of English
anything after 6  Corrected Sentences (If exists)

tense_asp.(train|test)

使用了Stanford Parser来标记出动词短语的tense和aspect，每列信息如下：

Column          Description
-------------------------------------------------
1               Serial number (corresponding to that of entries.(train|test))
2               Start point of the verb phrase (word offset)
3               End point of the verb phrase (word offset)
4               Head word of the verb phrase
5               Tense & aspect before correction
                Tense (PST=past, PRS=present, FTR=future, INF=infinitive, -)
                Perfect (PFT=perfect, PTP=participle, -)
                Progressive (PRG=progressive, PTP=participle, -)
6               Tense & aspect after correction
                (Same as 5)

how_to_ext_vp.txt

该文件说明了如何抽取候选的动词短语

文件夹3：lang-8-url-201012.txt 【数据链接，包含1个文件】

lang-8-url-201012.txt

该文件包含334,379个多语言条目的url链接，由59,455个活跃用户编写

3. JFLEG（官方提供）

Github：https://github.com/keisks/jfleg

文件夹1：dev 【系统输入数据】

示例：

So I think we can not live if old people could not find siences and tecnologies and they did not developped . 
For not use car . 
Here was no promise of morning except that we looked up through the trees we saw how low the forest had swung . 
Thus even today sex is considered as the least important topic in many parts of India .

文件夹2：test 【系统输入数据】

示例：

New and new technology has been introduced to the society .
One possible outcome is that an environmentally-induced reduction in motorization levels in the richer countries will outweigh any rise in motorization levels in the poorer countries .
Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .

文件夹3：EACL_exp 【该论文的实验】

文件夹4：eval 【评估实验的脚本】

li-aolong / li-aolong.github.io

7.11——GEC相关数据库整理 #2

1. NUCLE（非官方提供，2.2版本）

2. Lang-8（非官方提供）

3. JFLEG（官方提供）

4. enwiki