Arabic diacritical marks splitting single words into two separate morphs

buqamura commented 1 week ago

Describe the bug

When ankimorphs analyzes Arabic sentences that have diacritical marks on them, it splits the word into two separate morphs. It seems like these diacritical characters are treated as "spaces" by the AnkiMorphs: Language w/ Spaces morphemizer.

Steps to reproduce the behavior

(apologies for formatting or other errors; I'm new to this)

I have many sentences in my analyzed fields that have Arabic diacritical markings. For instance, I have the sentence:

أنا بصدّقك وإحنا دايماً منقلكم "I believe you and we always say to you..."

When dividing this into morphs, it takes the diacritical characters as word dividers, in this case ّ and ً ... in the latter case, it is at the end of the word so it doesn't disrupt too much, but in the case of ّ this divides the word in half, resulting in the following list of morphs in the am-unknowns field; I have bolded the now non-sensical word chunks:

أنا, بصد, دايما, قك, منقلكم, وإحنا

This is obviously a problem, as now those two word fragments mean nothing.

Here is a more dire example:

تضحك ، وتُغَمِّضُ عينيها ، وتَحْمَرُّ وَجنَتاها She laughs, and closes her eyes, and her cheeks redden

Which results in this with only 2 words left:

تاها, تضحك, جن, ح, ر, ض, عينيها, غ, م, و, وت

Expected behavior

What I had expected would happen is that words would be divided only by spaces and punctuation. So in my second example, it would have resulted in:

تضحك, وتُغَمِّضُ, عينيها, وتَحْمَرُّ, وَجنَتاها

This is what I was expecting, knowing that I was just using "spaces" and "collection frequency" since there are not (as far as I could find) any morphemizer or lemma resources for Arabic. So I knew that this would be a crude operation in the first place.

Now that I'm writing this, one possible solution to this would be to "clean" the words of any diacritical marks, which would let words with the same base spelling but different diacritical marks -- or none at all -- be grouped as the same morphs (surface forms in this case, since I don't have a morphological analyzer). That solution would result in the following:

تضحك، وتغمض، عينيها، وتحمر، وجنتاها

This solution would result in some combinations of words that are spelled with the same letters but have different pronunciations and/or meanings, so it may not be desirable for all users in all cases. Probably best to leave such things up to a morpheme/lemma document that may be created in the future.

But it seems like initially those diacritical markings should not be treated as "Spaces" to start with.

The markings that I would want kept as part of words are: shadda: ّ fatHa: َ Damma: ُ kasra: ِ tanween al-fatH: ً tanween aD-Damm: ٌ tanween al-kasr: ٍ sukoon: ْ

Any help with this greatly appreciated.

My AnkiMorphs settings

{ "filters": [ { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Back", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "ArabicArabic-AudioFront", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Front", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "ArabicArabicFrontBackSimple", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Text", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "Cloze-RTL", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Main Arabic Entry", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "Arabic Vocabulary", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Expression", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "subs2srs", "read": true, "tags": { "exclude": [], "include": [] } } ], "preprocess_ignore_bracket_contents": false, "preprocess_ignore_names_morphemizer": false, "preprocess_ignore_names_textfile": false, "preprocess_ignore_round_bracket_contents": false, "preprocess_ignore_slim_round_bracket_contents": false, "preprocess_ignore_suspended_cards_content": false, "recalc_due_offset": 500000, "recalc_interval_for_known": 21, "recalc_move_known_new_cards_to_the_end": false, "recalc_number_of_morphs_to_offset": 100, "recalc_offset_new_cards": false, "recalc_on_sync": true, "recalc_read_known_morphs_folder": false, "recalc_suspend_known_new_cards": false, "recalc_toolbar_stats_use_known": false, "recalc_toolbar_stats_use_seen": true, "recalc_unknowns_field_shows_inflections": true, "recalc_unknowns_field_shows_lemmas": false, "shortcut_browse_all_same_unknown": "Shift+L", "shortcut_browse_ready_same_unknown": "L", "shortcut_browse_ready_same_unknown_lemma": "Ctrl+Shift+L", "shortcut_generators": "Ctrl+Shift+G", "shortcut_known_morphs_exporter": "Ctrl+Shift+E", "shortcut_learn_now": "Ctrl+Alt+N", "shortcut_recalc": "Ctrl+M", "shortcut_set_known_and_skip": "K", "shortcut_settings": "Ctrl+O", "shortcut_view_morphemes": "Ctrl+Alt+V", "skip_only_known_morphs_cards": true, "skip_show_num_of_skipped_cards": true, "skip_unknown_morph_seen_today_cards": true, "tag_known_automatically": "am-known-automatically", "tag_known_manually": "am-known-manually", "tag_learn_card_now": "am-learn-card-now", "tag_not_ready": "am-not-ready", "tag_ready": "am-ready" }

My system

Operating System: Windows 10
Anki Version: 23.12.1
AnkiMorphs Version: 2.2.5

mortii commented 1 week ago

Interesting.

It seems it gets broken by this regex: https://github.com/mortii/anki-morphs/blob/cf081b8de6c92673e1f5a4c89fb5a8feec2f2999/ankimorphs/morphemizer.py#L130-L133

it produces this:

expression: أنا بصدّقك وإحنا دايماً منقلكم
word: أنا
word: بصد
word: قك
word: وإحنا
word: دايما
word: منقلكم

but if the regex is just replaced with:

word_list = [word.lower() for word in expression.split()]

then it looks like it works (to me at least)

expression: أنا بصدّقك وإحنا دايماً منقلكم
word: أنا
word: بصدّقك
word: وإحنا
word: دايماً
word: منقلكم

does that look right to you?

buqamura commented 1 week ago

Yes that looks like it is working as expected.

Rct567 commented 1 week ago

Could the regex be modified to extend the: \w (word characters) with \u0610-\u061A\u064B-\u065F (Arabic diacritical marks)?

mortii commented 1 week ago

Could the regex be modified to extend the: \w (word characters) with \u0610-\u061A\u064B-\u065F (Arabic diacritical marks)?

Absolutely, thanks!

I'll include this in the v3 update, which will probably be released in ~2 weeks.

buqamura commented 1 week ago

thanks to all. this addon is amazing and i'm very grateful for your generosity in creating this.

rwmpelstilzchen commented 5 days ago

Great! 🎉 I think this will solve a similar problem with Tamil, which I didn’t have the time to look into. If it will not be solve, and persist after V3 is released, I will open a new issue.

mortii / anki-morphs