Phrase's complexity assessment

hugolpz commented 1 year ago

Phrases/sentence's complexity assessment is a typical NLP tasks with a dedicated academic literature.

Discussion

A difficulty of the project is its multilingualism.

Script approach : there are pro and cons on using...
- a language-specialized tool from github for each language or
- a more rustic and generic function that works on all languages
Language specific and how does it affect your algorithm ?
- CJK and many other languages are not segmented by spaces, presenting a distinct yet typical NLP obstacle.
- Romance languages, with various inflections, present another distinct yet typical difficulty.
etc.

Input data

For a language L:

exercise sentences ( See #1 )
frequency list ( See #3 )

Processing

These codes should be versioned within this repository as well.

(optional) a segmenter hack based on frequency list
a lexical complexity algorithm
a syntactic complexity algorithm

Output

lexical complexity score(s)
syntactic complexity score(s)
final complexity score

Frequency lists

Frequency lists will be fundamental to your algorithms. This field is known for data of very variable quality. So you may want a code which can easily change input data as cleaner lists may be found later on the web. Some key projects to create your demonstrators :

https://github.com/unicode-org/unilex/tree/main/data/frequency -- row data from Google smart input suggestion project
https://github.com/lingua-libre/unilex-extended/tree/main/frequency-sorted-count -- same, cleaner. FR, EN, AR, JA, ZH are there. There are also better frequency lists but I have to look for them online.

Progresses

[x] AR: @AmrMohamed226 → [input outcome comments here]
[x] HI: @NehaDShakya →
[x] EN: Sarvesh →
[ ] JA: @Zamanax →

(Optional:)

[ ] FR: Matthieu →
[ ] CN: @hugolpz →

hugolpz commented 1 year ago

NEHA, AMR

[x] Discussion on sentences assessment
[x] Discussion on frequency sources
- Cancelled : Compare frequency lists via parallel coordinates
[x] Check UNILEX's corpora sizes -> Done: it's ok, quite large.

Sarveshmeenwa commented 1 year ago

Week 1

Used Textstat library in python to calculate phrase difficulties
Textstat has many metrics, most tailored for English. However, the Flesch Reading Ease metric supports other languages as below
Understanding the Flesch Reading Ease scores :
Formula used :
With this library, it is expected to calculate the difficulty level of French questions as well
Most are Latin languages, however it does support Arabic using OSMAN metric
OSMAN score for text (Designed for Arabic, an adaption of Flesch and Fog Formula. Introduces a new factor called "Faseeh".) Using this acedemic paper
With the help of @AmrMohamed226 , we can collaborate to do so for Arabic as well using TextStat library.
Harder to find unified method for all languages especially low-resource language but this is still a work in progress
Current demo (code & csv files uploaded) shows the Flesch Reading Ease score, however for English since there are many more, we could change to others, code is ready-made and adaptable to do so for English. (NOTE: Some pre-processing was required to calculate the English level difficulty score since the row text data had to be adapted according to its objective (Vocabulary, grammar, verb conjugation )

Next steps:

Find other NLP libraries to deal with Chinese, and Japanese texts
Research about metrics used for Chinese and Japanese, if none are found, possibly use the Frequency base approach along with an NLP library to deal with text pre-processing specific to Chinese and Japanese.
The English dataset may need to be further cleaned depending on requirements of Airtable.

References :

Week 2

Reevaluating formula

After discussions, the "Flesh Reading Ease" was deemed too simple and also unsuitable for our needs. This is due to : Sentences such as "The nucleus wall is thin." will be judged by Flesch as simple since the average word length and sentence length are short. The syntax is indeed very easy. It doesn't take into account that "Nucleus" is a rare and technical work we learn in science class, so most people won't have a clue. The Flesh Formula assumes understanding of individual words, which is the weak point of our exercises and language learning participants. Based on the mathematical formula, the Flesch formula is good for native speakers since it assumes all basic components (words, syllables) are known to the reader. In our case, this formula can be used to assess syntaxic complexity. But due to our non-native target users, we will need to add a lexical assessment based on frequency, which strongly correlates with learning path. The number of syllables per word is a measure of lexical complexity, but not enough for non-native speakers with a strongly limited vocabulary.

Gunning Fog

There are other metrics, such as Gunning Fog:

Source: https://en.wikipedia.org/wiki/Gunning_fog_index

Adjusted

However, we decided to make some adjustments with the formula, i.e., we implemented our own version of complex words using word frequency instead of the original approach using syllables. After some further research, word frequency can be used as a proxy for the difficulty level of the word. Modern text assessment service implementations like Twinword API use approaches with word frequency. Why is that ?

“Research show that frequency of a word might be correlated to its difficulty . The observation that the words that occurs in texts less often can be considered more difficult is one of the best and most widely used methods of estimating a word's difficulty. Such observations lead to creation of statistical word assessment, defining difficulty as a measure how often the word occurs in the given domain or in everyday life, meaning how easy it is to be seen. If the word is often seen in text than it is treated as easy for most of the population and vice versa” Source: Jagoda & Boiński (2019) https://www.researchgate.net/publication/322996917_Assessing_Word_Difficulty_for_Quiz-Like_Game

Jagoda & Boiński (2019) implemented 3 algorithms assessing word difficulty.

Algorithm 3 returns the best results; however, it is not ideal for our multilingual app since it implements algorithm 2 along with the Flesch formula. The Flesch formula is mostly used for English, and hence the weights are adjusted for English. However, using the Textstat library from Python, they adjusted the weights for French (and other languages as well) for the Flesch Reading Ease and Osman metric for Arabic (an adaptation of the Flesch and Fog Formula, it introduces a new factor called "Faseeh"). But it still means that we won’t be able to use it for Chinese, Japanese, Hindi, and other future languages that we might implement.

Algorithm 2 seems a better fit for our context and approach by using the length of the word and word frequency. We decided to create an initial difficulty level using Algorithm 2, which goes as follows: Source: Jagoda & Boiński (2019).

The algorithm's general premise is that by taking this approach, we can define difficult words as those that are both uncommon and lengthy. Simple words are more frequently short words or words that are used frequently. The final set of words, W, is once more arranged from the word that is easiest to the word that is most difficult. The score is determined by multiplying the word's length by the number of times it appears in the text. According to this method, the word with the lowest score is the one that combines length and rarity. The longer word will now be more challenging if two words have similar frequency scores.

Adaptating algorithm 2

Further adaptations and assumptions were made :

For calculation of s, for a difficult word, the frequency should be low and the length should be big, so according to the original formula, for e.g suppose the lmax=18 and the length of the word ‘the’ = 3 and ‘international’ = 13 have the same frequency, so in the relative length for ‘the’, it would be 3/18 and for ‘international’ = 13/18, so it would be the score would be lower for ‘the’ and higher for ‘international’, translating to the word ‘the’ being harder (having a lower score), but that counter-intuitive, hence the final formula was changed to : s = (1/lw) * (c/oMax) Note : We could also use lmax/lw , the goal here is to have an inverse relationship between score and length
Second change is the use of zipf frequency from the python library wordfreq instead of number of occurrences, simply because it is less memory-intensive since we don’t have to read the whole corpus T as we can use the library wordfreq as lookup table
Once we have all words sorted according to their difficulty score, we want to implement an adapted formula of Gunning Fog by dropping the 0.4 weight(to be reevaluated), that is

Final difficulty = avg sentence length + percentage of complex words

Futher discussions

Here lies another issue. Even if we have a table of words and their associated difficulties, what would be the cutoff score to determine whether it is easy or difficult ? Potential state solutions could use unsupervised machine learning clustering techniques to identify clusters and therefore groups of words, but for a simple list to distinguish between easy and hard, we were inspired by the ideas of Edgar Dale and Jeanne Chall instead used a list of 763 words that 80% of fourth-grade students were familiar with, such as "no", "yes", and other such very basic words to determine which words were difficult. Source : Dale-Chall readability formula We used the notion of that an easy word would be if the score of the word lies in the 80th percentile and a difficult word would lie in the 20th percentile of the scores (because the lower scores refer to difficult and higher scores refer to easy words). But in order to classify words based on percentiles, we had to make sure that the scores from our database follow a Gaussian distribution, which they did not :

Log feature transformation would not be a good due to the presence of scores of ‘0’, then with boxcox feature transformation, it requires the scores to be positive, the final feature transformation was boxcox transformation but by shifting the scores by 0.1 (the command would look something like boxcox(df_word_diff['score']+0.1), the distribution would then look like this :

Once this is done, a list of difficult words is created of scores being <= 20th percentile. The final formula works well with sentences, but not so great with words . For e.g for vocabulary exercises, it can give either a score of 101 and 1, 101 for difficult words(exist in difficult word list) or score of 1 simply because of the formula

for e.g "polyglot" is a difficult word so according to the formula, it is per_diff_words = 100 and avg_sentence_length = 1, so 100 + 1 = 101

The global idea seems something that can work for other languages as well using the word frequency approach (However, this still need to be tested)

hugolpz / WeSpeaxExos