Open hugolpz opened 1 year ago
Used Textstat library in python to calculate phrase difficulties
Textstat has many metrics, most tailored for English. However, the Flesch Reading Ease metric supports other languages as below
Understanding the Flesch Reading Ease scores :
Formula used :
With this library, it is expected to calculate the difficulty level of French questions as well
Most are Latin languages, however it does support Arabic using OSMAN metric
OSMAN score for text (Designed for Arabic, an adaption of Flesch and Fog Formula. Introduces a new factor called "Faseeh".) Using this acedemic paper
With the help of @AmrMohamed226 , we can collaborate to do so for Arabic as well using TextStat library.
Harder to find unified method for all languages especially low-resource language but this is still a work in progress
Current demo (code & csv files uploaded) shows the Flesch Reading Ease score, however for English since there are many more, we could change to others, code is ready-made and adaptable to do so for English. (NOTE: Some pre-processing was required to calculate the English level difficulty score since the row text data had to be adapted according to its objective (Vocabulary, grammar, verb conjugation )
Next steps:
References :
After discussions, the "Flesh Reading Ease" was deemed too simple and also unsuitable for our needs. This is due to : Sentences such as "The nucleus wall is thin." will be judged by Flesch as simple since the average word length and sentence length are short. The syntax is indeed very easy. It doesn't take into account that "Nucleus" is a rare and technical work we learn in science class, so most people won't have a clue. The Flesh Formula assumes understanding of individual words, which is the weak point of our exercises and language learning participants. Based on the mathematical formula, the Flesch formula is good for native speakers since it assumes all basic components (words, syllables) are known to the reader. In our case, this formula can be used to assess syntaxic complexity. But due to our non-native target users, we will need to add a lexical assessment based on frequency, which strongly correlates with learning path. The number of syllables per word is a measure of lexical complexity, but not enough for non-native speakers with a strongly limited vocabulary.
There are other metrics, such as Gunning Fog:
Source: https://en.wikipedia.org/wiki/Gunning_fog_index
However, we decided to make some adjustments with the formula, i.e., we implemented our own version of complex words using word frequency instead of the original approach using syllables. After some further research, word frequency can be used as a proxy for the difficulty level of the word. Modern text assessment service implementations like Twinword API use approaches with word frequency. Why is that ?
“Research show that frequency of a word might be correlated to its difficulty . The observation that the words that occurs in texts less often can be considered more difficult is one of the best and most widely used methods of estimating a word's difficulty. Such observations lead to creation of statistical word assessment, defining difficulty as a measure how often the word occurs in the given domain or in everyday life, meaning how easy it is to be seen. If the word is often seen in text than it is treated as easy for most of the population and vice versa” Source: Jagoda & Boiński (2019) https://www.researchgate.net/publication/322996917_Assessing_Word_Difficulty_for_Quiz-Like_Game
Jagoda & Boiński (2019) implemented 3 algorithms assessing word difficulty.
Algorithm 3 returns the best results; however, it is not ideal for our multilingual app since it implements algorithm 2 along with the Flesch formula. The Flesch formula is mostly used for English, and hence the weights are adjusted for English. However, using the Textstat library from Python, they adjusted the weights for French (and other languages as well) for the Flesch Reading Ease and Osman metric for Arabic (an adaptation of the Flesch and Fog Formula, it introduces a new factor called "Faseeh"). But it still means that we won’t be able to use it for Chinese, Japanese, Hindi, and other future languages that we might implement.
Algorithm 2 seems a better fit for our context and approach by using the length of the word and word frequency. We decided to create an initial difficulty level using Algorithm 2, which goes as follows:
Source: Jagoda & Boiński (2019).
The algorithm's general premise is that by taking this approach, we can define difficult words as those that are both uncommon and lengthy. Simple words are more frequently short words or words that are used frequently. The final set of words, W, is once more arranged from the word that is easiest to the word that is most difficult. The score is determined by multiplying the word's length by the number of times it appears in the text. According to this method, the word with the lowest score is the one that combines length and rarity. The longer word will now be more challenging if two words have similar frequency scores.
Further adaptations and assumptions were made :
For calculation of s, for a difficult word, the frequency should be low and the length should be big, so according to the original formula, for e.g suppose the lmax=18
and the length of the word ‘the’ = 3
and ‘international’ = 13
have the same frequency, so in the relative length for ‘the’, it would be 3/18 and for ‘international’ = 13/18, so it would be the score would be lower for ‘the’ and higher for ‘international’, translating to the word ‘the’ being harder (having a lower score), but that counter-intuitive, hence the final formula was changed to : s = (1/lw) * (c/oMax)
Note : We could also use lmax/lw
, the goal here is to have an inverse relationship between score and length
Second change is the use of zipf frequency from the python library wordfreq
instead of number of occurrences, simply because it is less memory-intensive since we don’t have to read the whole corpus T as we can use the library wordfreq
as lookup table
Once we have all words sorted according to their difficulty score, we want to implement an adapted formula of Gunning Fog by dropping the 0.4 weight(to be reevaluated), that is
Final difficulty = avg sentence length + percentage of complex words
Here lies another issue. Even if we have a table of words and their associated difficulties, what would be the cutoff score to determine whether it is easy or difficult ?
Potential state solutions could use unsupervised machine learning clustering techniques to identify clusters and therefore groups of words, but for a simple list to distinguish between easy and hard, we were inspired by the ideas of Edgar Dale and Jeanne Chall instead used a list of 763 words that 80% of fourth-grade students were familiar with, such as "no", "yes", and other such very basic words to determine which words were difficult.
Source : Dale-Chall readability formula
We used the notion of that an easy word would be if the score of the word lies in the 80th percentile and a difficult word would lie in the 20th percentile of the scores (because the lower scores refer to difficult and higher scores refer to easy words). But in order to classify words based on percentiles, we had to make sure that the scores from our database follow a Gaussian distribution, which they did not :
Log feature transformation would not be a good due to the presence of scores of ‘0’, then with boxcox feature transformation, it requires the scores to be positive, the final feature transformation was boxcox transformation but by shifting the scores by 0.1 (the command would look something like boxcox(df_word_diff['score']+0.1)
, the distribution would then look like this :
Once this is done, a list of difficult words is created of scores being <= 20th percentile. The final formula works well with sentences, but not so great with words . For e.g for vocabulary exercises, it can give either a score of 101 and 1, 101 for difficult words(exist in difficult word list) or score of 1 simply because of the formula
for e.g "polyglot" is a difficult word so according to the formula, it is per_diff_words = 100 and avg_sentence_length = 1, so 100 + 1 = 101
The global idea seems something that can work for other languages as well using the word frequency approach (However, this still need to be tested)
Phrases/sentence's complexity assessment is a typical NLP tasks with a dedicated academic literature.
Discussion
A difficulty of the project is its multilingualism.
Input data
For a language L:
Processing
These codes should be versioned within this repository as well.
Output
Frequency lists
Frequency lists will be fundamental to your algorithms. This field is known for data of very variable quality. So you may want a code which can easily change input data as cleaner lists may be found later on the web. Some key projects to create your demonstrators :
See also
For discussion:
Progresses
AR
: @AmrMohamed226 → [input outcome comments here]HI
: @NehaDShakya →EN
: Sarvesh →JA
: @Zamanax →(Optional:)
FR
: Matthieu →CN
: @hugolpz →