Inflections and the Dale-Chall-Formula

LKirst commented 3 years ago

The textstat implementation of the Dale-Chall-Formula classifies several words as difficult words that the original Dale-Chall-Formula would not. For example, Scotland, returned, giants, giant's, strongest are returned as part of textstat.difficult_words_list(text), even though the base forms return, giant, strong are all part of the easy words list.

Dale and Chall (1948, p. 38-49) suggest that the following word forms should be considered familiar:

names of persons and places
regular plurals and possessives of words on the list
the third-person, singular forms (s or ies from y), present-participle forms (ing), past-participle forms (n), and past-tense forms (ed or ied from y), when these are added to verbs appearing on the list
comparatives and superlatives of adjectives appearing on the list
adverbs familiar which are formed by adding ly to a word on the list

The complete list of rules can be found in Dale & Chall (1948).

I understand that most of these rules are not easy to implement for the textstat package, but to avoid confusion and maybe prompt users to check the list returned by textstat.difficult_words_list(text), the README could point out the deviation from the original Dale & Chall formula?

Source: Dale, E., & Chall, J. (1948). A Formula for Predicting Readability: Instructions. Educational Research Bulletin, 27(2), 37-54. Retrieved August 11, 2021, from http://www.jstor.org/stable/1473669

alxwrd commented 3 years ago

Hi @LKirst, thank you for raising this!

We currently have an open issue (#73) touching on difficult word usage. We currently have 4 methods/metrics that use difficult_words:

dale_chall_readability_score
gunning_fog
spache_readability
dale_chall_readability_score_v2

Maybe this area could do with a re-visit, and it doesn't make sense to use the same difficult_words method for everything.

dogweather commented 2 years ago

I believe this is a problem that stemming solves. E.g.:

The Dale and Chall wordlist is converted to a set of the stems of the words.
An input text's words are each mapped to their stem.
Each word is then judged to be simple if its stem is in the Dale and Chall stem list. (As opposed to the word itself being present in the Dale and Chall word list.

LKirst commented 2 years ago

Great idea. Could we separate regular inflection from irregular word formation using an NLTK stemmer? Could you implement your solution?

dogweather commented 2 years ago

Great idea. Could we separate regular inflection from irregular word formation using an NLTK stemmer? Could you implement your solution?

Totally — I'll start a PR. I'll look into what NLTK supports. I can imagine providing options for the kinds of inflections accepted.

dogweather commented 2 years ago

I found a good conversation of a similar idea implemented in Javascript:

textstat / textstat

Inflections and the Dale-Chall-Formula #150