hugolpz / WeSpeaxExos

Exercices per languages
1 stars 1 forks source link

Phrase's complexity assessment #2

Open hugolpz opened 1 year ago

hugolpz commented 1 year ago

Phrases/sentence's complexity assessment is a typical NLP tasks with a dedicated academic literature.

Discussion

A difficulty of the project is its multilingualism.

Input data

For a language L:

Processing

These codes should be versioned within this repository as well.

Output

Frequency lists

Frequency lists will be fundamental to your algorithms. This field is known for data of very variable quality. So you may want a code which can easily change input data as cleaner lists may be found later on the web. Some key projects to create your demonstrators :

See also

For discussion:

Progresses

(Optional:)

hugolpz commented 1 year ago

NEHA, AMR

Sarveshmeenwa commented 1 year ago

Week 1

Next steps:

References :

Week 2

Reevaluating formula

After discussions, the "Flesh Reading Ease" was deemed too simple and also unsuitable for our needs. This is due to : Sentences such as "The nucleus wall is thin." will be judged by Flesch as simple since the average word length and sentence length are short. The syntax is indeed very easy. It doesn't take into account that "Nucleus" is a rare and technical work we learn in science class, so most people won't have a clue. The Flesh Formula assumes understanding of individual words, which is the weak point of our exercises and language learning participants. Based on the mathematical formula, the Flesch formula is good for native speakers since it assumes all basic components (words, syllables) are known to the reader. In our case, this formula can be used to assess syntaxic complexity. But due to our non-native target users, we will need to add a lexical assessment based on frequency, which strongly correlates with learning path. The number of syllables per word is a measure of lexical complexity, but not enough for non-native speakers with a strongly limited vocabulary.

Gunning Fog

There are other metrics, such as Gunning Fog:

image

Source: https://en.wikipedia.org/wiki/Gunning_fog_index

Adjusted

However, we decided to make some adjustments with the formula, i.e., we implemented our own version of complex words using word frequency instead of the original approach using syllables. After some further research, word frequency can be used as a proxy for the difficulty level of the word. Modern text assessment service implementations like Twinword API use approaches with word frequency. Why is that ?

“Research show that frequency of a word might be correlated to its difficulty . The observation that the words that occurs in texts less often can be considered more difficult is one of the best and most widely used methods of estimating a word's difficulty. Such observations lead to creation of statistical word assessment, defining difficulty as a measure how often the word occurs in the given domain or in everyday life, meaning how easy it is to be seen. If the word is often seen in text than it is treated as easy for most of the population and vice versa” Source: Jagoda & Boiński (2019) https://www.researchgate.net/publication/322996917_Assessing_Word_Difficulty_for_Quiz-Like_Game

Jagoda & Boiński (2019) implemented 3 algorithms assessing word difficulty.

Algorithm 3 returns the best results; however, it is not ideal for our multilingual app since it implements algorithm 2 along with the Flesch formula. The Flesch formula is mostly used for English, and hence the weights are adjusted for English. However, using the Textstat library from Python, they adjusted the weights for French (and other languages as well) for the Flesch Reading Ease and Osman metric for Arabic (an adaptation of the Flesch and Fog Formula, it introduces a new factor called "Faseeh"). But it still means that we won’t be able to use it for Chinese, Japanese, Hindi, and other future languages that we might implement.

Algorithm 2 seems a better fit for our context and approach by using the length of the word and word frequency. We decided to create an initial difficulty level using Algorithm 2, which goes as follows: image Source: Jagoda & Boiński (2019).

The algorithm's general premise is that by taking this approach, we can define difficult words as those that are both uncommon and lengthy. Simple words are more frequently short words or words that are used frequently. The final set of words, W, is once more arranged from the word that is easiest to the word that is most difficult. The score is determined by multiplying the word's length by the number of times it appears in the text. According to this method, the word with the lowest score is the one that combines length and rarity. The longer word will now be more challenging if two words have similar frequency scores.

Adaptating algorithm 2

Further adaptations and assumptions were made :

Final difficulty = avg sentence length + percentage of complex words

Futher discussions

Here lies another issue. Even if we have a table of words and their associated difficulties, what would be the cutoff score to determine whether it is easy or difficult ? Potential state solutions could use unsupervised machine learning clustering techniques to identify clusters and therefore groups of words, but for a simple list to distinguish between easy and hard, we were inspired by the ideas of Edgar Dale and Jeanne Chall instead used a list of 763 words that 80% of fourth-grade students were familiar with, such as "no", "yes", and other such very basic words to determine which words were difficult. Source : Dale-Chall readability formula We used the notion of that an easy word would be if the score of the word lies in the 80th percentile and a difficult word would lie in the 20th percentile of the scores (because the lower scores refer to difficult and higher scores refer to easy words). But in order to classify words based on percentiles, we had to make sure that the scores from our database follow a Gaussian distribution, which they did not : image

Log feature transformation would not be a good due to the presence of scores of ‘0’, then with boxcox feature transformation, it requires the scores to be positive, the final feature transformation was boxcox transformation but by shifting the scores by 0.1 (the command would look something like boxcox(df_word_diff['score']+0.1), the distribution would then look like this : image

Once this is done, a list of difficult words is created of scores being <= 20th percentile. The final formula works well with sentences, but not so great with words . For e.g for vocabulary exercises, it can give either a score of 101 and 1, 101 for difficult words(exist in difficult word list) or score of 1 simply because of the formula

for e.g "polyglot" is a difficult word so according to the formula, it is per_diff_words = 100 and avg_sentence_length = 1, so 100 + 1 = 101

The global idea seems something that can work for other languages as well using the word frequency approach (However, this still need to be tested)