Open buqamura opened 1 week ago
Interesting.
It seems it gets broken by this regex: https://github.com/mortii/anki-morphs/blob/cf081b8de6c92673e1f5a4c89fb5a8feec2f2999/ankimorphs/morphemizer.py#L130-L133
it produces this:
expression: أنا بصدّقك وإحنا دايماً منقلكم
word: أنا
word: بصد
word: قك
word: وإحنا
word: دايما
word: منقلكم
but if the regex is just replaced with:
word_list = [word.lower() for word in expression.split()]
then it looks like it works (to me at least)
expression: أنا بصدّقك وإحنا دايماً منقلكم
word: أنا
word: بصدّقك
word: وإحنا
word: دايماً
word: منقلكم
does that look right to you?
Yes that looks like it is working as expected.
Could the regex be modified to extend the: \w (word characters) with \u0610-\u061A\u064B-\u065F (Arabic diacritical marks)?
Could the regex be modified to extend the: \w (word characters) with \u0610-\u061A\u064B-\u065F (Arabic diacritical marks)?
Absolutely, thanks!
I'll include this in the v3 update, which will probably be released in ~2 weeks.
thanks to all. this addon is amazing and i'm very grateful for your generosity in creating this.
Great! 🎉 I think this will solve a similar problem with Tamil, which I didn’t have the time to look into. If it will not be solve, and persist after V3 is released, I will open a new issue.
Describe the bug
When ankimorphs analyzes Arabic sentences that have diacritical marks on them, it splits the word into two separate morphs. It seems like these diacritical characters are treated as "spaces" by the AnkiMorphs: Language w/ Spaces morphemizer.
Steps to reproduce the behavior
(apologies for formatting or other errors; I'm new to this)
I have many sentences in my analyzed fields that have Arabic diacritical markings. For instance, I have the sentence:
When dividing this into morphs, it takes the diacritical characters as word dividers, in this case ّ and ً ... in the latter case, it is at the end of the word so it doesn't disrupt too much, but in the case of ّ this divides the word in half, resulting in the following list of morphs in the am-unknowns field; I have bolded the now non-sensical word chunks:
This is obviously a problem, as now those two word fragments mean nothing.
Here is a more dire example:
Which results in this with only 2 words left:
Expected behavior
What I had expected would happen is that words would be divided only by spaces and punctuation. So in my second example, it would have resulted in:
This is what I was expecting, knowing that I was just using "spaces" and "collection frequency" since there are not (as far as I could find) any morphemizer or lemma resources for Arabic. So I knew that this would be a crude operation in the first place.
Now that I'm writing this, one possible solution to this would be to "clean" the words of any diacritical marks, which would let words with the same base spelling but different diacritical marks -- or none at all -- be grouped as the same morphs (surface forms in this case, since I don't have a morphological analyzer). That solution would result in the following:
This solution would result in some combinations of words that are spelled with the same letters but have different pronunciations and/or meanings, so it may not be desirable for all users in all cases. Probably best to leave such things up to a morpheme/lemma document that may be created in the future.
But it seems like initially those diacritical markings should not be treated as "Spaces" to start with.
The markings that I would want kept as part of words are: shadda: ّ fatHa: َ Damma: ُ kasra: ِ tanween al-fatH: ً tanween aD-Damm: ٌ tanween al-kasr: ٍ sukoon: ْ
Any help with this greatly appreciated.
My AnkiMorphs settings
{ "filters": [ { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Back", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "ArabicArabic-AudioFront", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Front", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "ArabicArabicFrontBackSimple", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Text", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "Cloze-RTL", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Main Arabic Entry", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "Arabic Vocabulary", "read": true, "tags": { "exclude": [], "include": [] } }, { "extra_highlighted": true, "extra_score": true, "extra_unknowns": true, "extra_unknowns_count": true, "field": "Expression", "modify": true, "morph_priority": "Collection frequency", "morphemizer_description": "AnkiMorphs: Language w/ Spaces", "note_type": "subs2srs", "read": true, "tags": { "exclude": [], "include": [] } } ], "preprocess_ignore_bracket_contents": false, "preprocess_ignore_names_morphemizer": false, "preprocess_ignore_names_textfile": false, "preprocess_ignore_round_bracket_contents": false, "preprocess_ignore_slim_round_bracket_contents": false, "preprocess_ignore_suspended_cards_content": false, "recalc_due_offset": 500000, "recalc_interval_for_known": 21, "recalc_move_known_new_cards_to_the_end": false, "recalc_number_of_morphs_to_offset": 100, "recalc_offset_new_cards": false, "recalc_on_sync": true, "recalc_read_known_morphs_folder": false, "recalc_suspend_known_new_cards": false, "recalc_toolbar_stats_use_known": false, "recalc_toolbar_stats_use_seen": true, "recalc_unknowns_field_shows_inflections": true, "recalc_unknowns_field_shows_lemmas": false, "shortcut_browse_all_same_unknown": "Shift+L", "shortcut_browse_ready_same_unknown": "L", "shortcut_browse_ready_same_unknown_lemma": "Ctrl+Shift+L", "shortcut_generators": "Ctrl+Shift+G", "shortcut_known_morphs_exporter": "Ctrl+Shift+E", "shortcut_learn_now": "Ctrl+Alt+N", "shortcut_recalc": "Ctrl+M", "shortcut_set_known_and_skip": "K", "shortcut_settings": "Ctrl+O", "shortcut_view_morphemes": "Ctrl+Alt+V", "skip_only_known_morphs_cards": true, "skip_show_num_of_skipped_cards": true, "skip_unknown_morph_seen_today_cards": true, "tag_known_automatically": "am-known-automatically", "tag_known_manually": "am-known-manually", "tag_learn_card_now": "am-learn-card-now", "tag_not_ready": "am-not-ready", "tag_ready": "am-ready" }
My system