Closed shlomihod closed 6 years ago
Reading.
1) Objective : A) Identification of sentences understandable by second language learners of Swedish, which can be used in automatically generated exercises based on corpora. B) How to exploit existing Natural Language Processing (NLP) tools to assess the suitability of the available corpus. 2) Findings : A) Out of a number of deep linguistic indicators explored, mainly lexical-morphological and semantic features are found informative for second language sentence-level readability. B) Classification accuracy of 71%. C) Top 10 informative features:
Rank Feature-ID Weight
1 DiffW% 0.576
2 Sense/W 0.438
3 DiffWs 0.422
4 SentLen 0.258
5 Mod 0.223
6 KellyFr 0.215
7 NomR 0.132
8 AdvVar 0.114
9 Ddep/SentLen 0.08
10 DeepDep 0.08
3) Features considered : Refer figure 2. 4) Dataset : Level Source Nr. sentences A) Within B1 B1 (CEFR) texts 2358 B) Above B1 B2 (CEFR) texts 795 C) Above B1 Korp corpora 1528 D) Total size of dataset 4681 5) Method : Supervised Classification (Linear Support Vector Machine (SVM) classifier). 6) Evaluation was carried out using 10-fold cross-validation, i.e. the proportion of labels in each fold was kept the same as that in the whole training set during the ten iterations of training and testing. 7) Results :
A) On all the 28 features
Classifier Acc F1 B1-Prec B1-Recall Baseline 0.50 0.66 0.50 1.00 SVM 0.71 0.70 0.73 0.68
B) On seperate feature groups Feature group Acc F1 (Nr of features)
Traditional 0.59 0.55 Syntactic 0.59 0.54 Lexical 0.70 0.70 Semantic 0.61 0.55
http://www.aclweb.org/anthology/W14-1821