Column Description
-------------------------------------------------
1 Number of correction
2 Serial number
3 the URL of the entry
4 Sentence number. 0 is the title
5 Sentence written by a learner of English
anything after 6 Corrected Sentences (If exists)
tense_asp.(train|test)
使用了Stanford Parser来标记出动词短语的tense和aspect,每列信息如下:
Column Description
-------------------------------------------------
1 Serial number (corresponding to that of entries.(train|test))
2 Start point of the verb phrase (word offset)
3 End point of the verb phrase (word offset)
4 Head word of the verb phrase
5 Tense & aspect before correction
Tense (PST=past, PRS=present, FTR=future, INF=infinitive, -)
Perfect (PFT=perfect, PTP=participle, -)
Progressive (PRG=progressive, PTP=participle, -)
6 Tense & aspect after correction
(Same as 5)
So I think we can not live if old people could not find siences and tecnologies and they did not developped .
For not use car .
Here was no promise of morning except that we looked up through the trees we saw how low the forest had swung .
Thus even today sex is considered as the least important topic in many parts of India .
文件夹2:test 【系统输入数据】
示例:
New and new technology has been introduced to the society .
One possible outcome is that an environmentally-induced reduction in motorization levels in the richer countries will outweigh any rise in motorization levels in the poorer countries .
Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .
1. NUCLE(非官方提供,2.2版本)
文件夹1:data 【包含预处理后的NUCLE,预处理步骤分为4步】
nucle.sorted.relabeled.sgml
NUCLE语料库,character offsets
conll13st-preprocessed.conll
前三步的结果,包含在CoNLL样式列格式中的预处理数据
conll13st-preprocessed.conll.ann
第四步的结果,包含token级别的注释
conll13st-preprocessed.m2
Scorer的标准格式,token offsets
文件夹2:data_5types 【data中5个重点错误标签的数据】
文件夹3:m2scorer 【一种评价方法】
文件夹4:scripts【处理脚本】
2. Lang-8(非官方提供)
文件夹1:lang-8-20111007-2.0 【原始数据,包含1个文件】
lang-8-20111007-L1-v2.dat
其数据结构如下:
文件夹2:lang-8-en-1.0 【英文数据,去除了矫正数据,包含3种文件】
entries.(train|test)
train和test文件分别有100,000和1,000个条目,用空格分开,每行是每个句子的信息,每列信息如下:
tense_asp.(train|test)
使用了Stanford Parser来标记出动词短语的tense和aspect,每列信息如下:
how_to_ext_vp.txt
该文件说明了如何抽取候选的动词短语
文件夹3:lang-8-url-201012.txt 【数据链接,包含1个文件】
lang-8-url-201012.txt
该文件包含334,379个多语言条目的url链接,由59,455个活跃用户编写
3. JFLEG(官方提供)
相关论文:https://aclweb.org/anthology/E17-2037
Github:https://github.com/keisks/jfleg
文件夹1:dev 【系统输入数据】
示例:
文件夹2:test 【系统输入数据】
示例:
文件夹3:EACL_exp 【该论文的实验】
文件夹4:eval 【评估实验的脚本】
4. enwiki