nishkalavallabhi/OneStopEnglishCorpus

This repository hosts the dataset described in the following paper:

OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification
Sowmya Vajjala and Ivana Lučić
2018
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 297–304. Association for Computational Linguistics.
url. bib file

Please cite the above paper if you use this corpus in your research.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Description of this repo:

Texts-SeparatedByReadingLevel/: This is the actual corpus folder, containing three sub-folders, one per reading level. Each file has the same name followed by a -ele.txt/-int.txt/-adv.txt depending on the sub-folder it is in.
Texts-Together-OneCSVperFile/: This folder has one csv file per text, three columns for three reading levels. Paragraph breaks are preserved.
Sentence-Aligned/: This folder contains three text files, with pair-wise sentence alignments (adv-int, int-ele, adv-ele). Cosine similarity was used to align sentences.
Processed-AllLevels-AllFiles/ : folder contains sub-folders with output files from Stanford parser, Stanford CoreNLP, and Upenn's Discourse Connectives Tagger

For enquiries: contact: sowmya@iastate.edu

nishkalavallabhi / OneStopEnglishCorpus

readme