li-aolong / li-aolong.github.io

李傲龍的博客
https://aolong.me
82 stars 16 forks source link

7.11——GEC相关数据库整理 #2

Open li-aolong opened 5 years ago

li-aolong commented 5 years ago

1. NUCLE(非官方提供,2.2版本)

文件夹1:data 【包含预处理后的NUCLE,预处理步骤分为4步】

文件夹2:data_5types 【data中5个重点错误标签的数据】

文件夹3:m2scorer 【一种评价方法】

文件夹4:scripts【处理脚本】

2. Lang-8(非官方提供)

文件夹1:lang-8-20111007-2.0 【原始数据,包含1个文件】

["journal_id",
  "sentence_id",
  "learning_language",
  "native_language",
  ["learner_sentence1","learner_sentence2",...],
  [["correction1_to_sentence1","correction2_to_sentence1",...],
   ["correction1_to_sentence2","correction2_to_sentence2",...],
   ...],
]

文件夹2:lang-8-en-1.0 【英文数据,去除了矫正数据,包含3种文件】

Column            Description
-------------------------------------------------
1                 Number of correction
2                 Serial number
3                 the URL of the entry
4                 Sentence number. 0 is the title
5                 Sentence written by a learner of English
anything after 6  Corrected Sentences (If exists)
Column          Description
-------------------------------------------------
1               Serial number (corresponding to that of entries.(train|test))
2               Start point of the verb phrase (word offset)
3               End point of the verb phrase (word offset)
4               Head word of the verb phrase
5               Tense & aspect before correction
                Tense (PST=past, PRS=present, FTR=future, INF=infinitive, -)
                Perfect (PFT=perfect, PTP=participle, -)
                Progressive (PRG=progressive, PTP=participle, -)
6               Tense & aspect after correction
                (Same as 5)

文件夹3:lang-8-url-201012.txt 【数据链接,包含1个文件】

3. JFLEG(官方提供)

相关论文:https://aclweb.org/anthology/E17-2037

Github:https://github.com/keisks/jfleg

文件夹1:dev 【系统输入数据】

​ 示例:

So I think we can not live if old people could not find siences and tecnologies and they did not developped . 
For not use car . 
Here was no promise of morning except that we looked up through the trees we saw how low the forest had swung . 
Thus even today sex is considered as the least important topic in many parts of India . 

文件夹2:test 【系统输入数据】

​ 示例:

New and new technology has been introduced to the society .
One possible outcome is that an environmentally-induced reduction in motorization levels in the richer countries will outweigh any rise in motorization levels in the poorer countries .
Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .

文件夹3:EACL_exp 【该论文的实验】

文件夹4:eval 【评估实验的脚本】

4. enwiki