textstat / textstat

:memo: python package to calculate readability statistics of a text object - paragraphs, sentences, articles.
https://textstat.org
MIT License
1.13k stars 165 forks source link

Inflections and the Dale-Chall-Formula #150

Open LKirst opened 3 years ago

LKirst commented 3 years ago

The textstat implementation of the Dale-Chall-Formula classifies several words as difficult words that the original Dale-Chall-Formula would not. For example, Scotland, returned, giants, giant's, strongest are returned as part of textstat.difficult_words_list(text), even though the base forms return, giant, strong are all part of the easy words list.

Dale and Chall (1948, p. 38-49) suggest that the following word forms should be considered familiar:

The complete list of rules can be found in Dale & Chall (1948).

I understand that most of these rules are not easy to implement for the textstat package, but to avoid confusion and maybe prompt users to check the list returned by textstat.difficult_words_list(text), the README could point out the deviation from the original Dale & Chall formula?

Source: Dale, E., & Chall, J. (1948). A Formula for Predicting Readability: Instructions. Educational Research Bulletin, 27(2), 37-54. Retrieved August 11, 2021, from http://www.jstor.org/stable/1473669

alxwrd commented 3 years ago

Hi @LKirst, thank you for raising this!

We currently have an open issue (#73) touching on difficult word usage. We currently have 4 methods/metrics that use difficult_words:

Maybe this area could do with a re-visit, and it doesn't make sense to use the same difficult_words method for everything.

dogweather commented 2 years ago

I believe this is a problem that stemming solves. E.g.:

  1. The Dale and Chall wordlist is converted to a set of the stems of the words.
  2. An input text's words are each mapped to their stem.
  3. Each word is then judged to be simple if its stem is in the Dale and Chall stem list. (As opposed to the word itself being present in the Dale and Chall word list.
LKirst commented 2 years ago

Great idea. Could we separate regular inflection from irregular word formation using an NLTK stemmer? Could you implement your solution?

dogweather commented 2 years ago

Great idea. Could we separate regular inflection from irregular word formation using an NLTK stemmer? Could you implement your solution?

Totally — I'll start a PR. I'll look into what NLTK supports. I can imagine providing options for the kinds of inflections accepted.

dogweather commented 2 years ago

I found a good conversation of a similar idea implemented in Javascript: