shlomihod / deep-text-eval

Differnable Readability Measure Regularizer for Neural Network Automatic Text Simplification
24 stars 7 forks source link

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts #9

Closed shlomihod closed 6 years ago

shlomihod commented 6 years ago

http://www.aclweb.org/anthology/N07-1058

vageeshSaxena commented 6 years ago

Reading..

vageeshSaxena commented 6 years ago

1) Objective : Pedagogically motivated grammatical features (e.g., passive voice, rather than the number of words per sen- tence) can improve readability measures based on lexical features alone. 2) Prediction system based on both Vocabulary and Grammatical Features. 3) The grammar-based predictions are combined using confidence scores with the vocabulary-based predictions to produce more accurate predictions of reading difficulty for both first and second language texts. 4) Language Model Readability Prediction for First Language Texts: (Based on Linear Regression using 2 or 3 Variables) A) To build a statistical model of text, training examples are used to collect statistics such as word frequency and order. B) Each training example has a label that tells the model the ‘true’ category of the example. C) Advantages : ü) A language modeling approach generally gives much better accuracy for Web documents and short passages. ö) A language modeling provides a probability distribution across all grade models, not just a single prediction. ä) A language modeling provides more data on the relative difficulty of each word in the document. 5) Grammatical Construction Readability Prediction for Second Language Texts: A) Features for Grammatical-based Prediction: ü) Step 1 : syntactically parsing the document a) Stanford Parser b) PCFG scores from the parser were also used to filter out some of the illformed text present in the test corpora. c) Training set - Penn Treebank ö) Step 2: Use TGrep2 tool(a tree structure searching tool, to identify instances of the target patterns like dominance, sisterhood, precedence, and other relationships between nodes in the parse tree for a sentence). ä) Step 3: Calculate the rate of occurence of the contructions on per word basis so to make the complexity independent on the sentence length. @) Step 4 : A second feature set was defined that consisted of 12 grammatical features that could easily be identified without computationally intensive syntactic parsing. These features included sentence length, the various verb forms in English, including the present, progressive, past, perfect, continuous tenses, as well as part of speech labels for words. The goal of using a second feature set was to examine how dependent prediction quality was on a specific set of features, as well as to test the extent to which the output of syntactic parsing might improve prediction accuracy. 6) Algorithm for Grammatical Feature based Classification: (KNN Classification) 7) Results : A) The first and second language corpora, the language modeling approach alone produced more accurate predictions than the grammar-based approach alone. b) The mean squared error values were lower, and the correlation coefficients were ere higher for the LM predictor than the grammar-based predictor. C) The interpolated predictions combined by using the kNN confidence measure were slightly and in most tests significantly more accurate in terms of mean squared error than the predictions from either single measure. Interpolation using the first set of grammatical features led to 7% and 22% reductions in mean squared error on the L1 and L2 corpora, respectively.