Japanese verbs are parsed incorrectly, cutting off after the character っ

mdraves91 commented 1 month ago

Describe the bug

When ankimorphs analyzes Japanese sentences, it splits a lot of conjugated verbs into separate morphs on the character っ.

Steps to reproduce the behavior

Analyze a set of Japanese sentences and you will see many verbs (especially the -te form and -ta form) that are broken into two morphs on the character っ.

For example, the sentence:

あたりが静まりかえった。

considers the verb 静まりかえった as two morphs, 静まりかえっ and た when it should just be one. This results in a lot of cards where the am-unknowns field has a verb that cuts off at っ.

In my deck of 1383 sentences, about 168 of them have a verb that is broken like this in the am-unknowns field. (I searched for them with tag:am-* (am-unknowns:*っ,* OR am-unknowns:*っ). there are a small handful of false positives where a valid word did end in っ.

Expected behavior

Verbs should parse as just one morph instead of two. Other tools correctly parse the verb, such as the jisho.org dictionary: jisho.org/あたりが静まりかえった。

Screenshots

My AnkiMorphs settings

{
    "algorithm_all_morphs_target_distance": 1,
    "algorithm_average_priority_all_morphs": 0,
    "algorithm_inflection_priority": false,
    "algorithm_learning_morphs_target_distance": 5,
    "algorithm_lemma_priority": true,
    "algorithm_lower_target_all_morphs": 4,
    "algorithm_lower_target_all_morphs_coefficient_a": 0,
    "algorithm_lower_target_all_morphs_coefficient_b": 1,
    "algorithm_lower_target_all_morphs_coefficient_c": 0,
    "algorithm_lower_target_learning_morphs": 1,
    "algorithm_lower_target_learning_morphs_coefficient_a": 1,
    "algorithm_lower_target_learning_morphs_coefficient_b": 0,
    "algorithm_lower_target_learning_morphs_coefficient_c": 0,
    "algorithm_total_priority_all_morphs": 1,
    "algorithm_total_priority_unknown_morphs": 10,
    "algorithm_upper_target_all_morphs": 6,
    "algorithm_upper_target_all_morphs_coefficient_a": 1,
    "algorithm_upper_target_all_morphs_coefficient_b": 0,
    "algorithm_upper_target_all_morphs_coefficient_c": 0,
    "algorithm_upper_target_learning_morphs": 2,
    "algorithm_upper_target_learning_morphs_coefficient_a": 1,
    "algorithm_upper_target_learning_morphs_coefficient_b": 0,
    "algorithm_upper_target_learning_morphs_coefficient_c": 0,
    "filters": [
        {
            "extra_highlighted": false,
            "extra_score": false,
            "extra_score_terms": false,
            "extra_unknowns": false,
            "extra_unknowns_count": false,
            "field": "Expression",
            "modify": false,
            "morph_priority": "Collection frequency",
            "morphemizer_description": "AnkiMorphs: Japanese",
            "note_type": "Japanese (recognition)",
            "read": true,
            "tags": {
                "exclude": [],
                "include": []
            }
        },
        {
            "extra_highlighted": false,
            "extra_score": false,
            "extra_score_terms": false,
            "extra_unknowns": false,
            "extra_unknowns_count": false,
            "field": "Japanese",
            "modify": false,
            "morph_priority": "Collection frequency",
            "morphemizer_description": "AnkiMorphs: Japanese",
            "note_type": "A Dictionary of Japanese Grammar",
            "read": true,
            "tags": {
                "exclude": [],
                "include": []
            }
        },
        {
            "extra_highlighted": true,
            "extra_score": false,
            "extra_score_terms": false,
            "extra_unknowns": true,
            "extra_unknowns_count": true,
            "field": "Expression",
            "modify": true,
            "morph_priority": "Collection frequency",
            "morphemizer_description": "AnkiMorphs: Japanese",
            "note_type": "Japanese Morphman Audio Ankiweb",
            "read": true,
            "tags": {
                "exclude": [],
                "include": []
            }
        },
        {
            "extra_highlighted": true,
            "extra_score": false,
            "extra_score_terms": false,
            "extra_unknowns": true,
            "extra_unknowns_count": true,
            "field": "Reading",
            "modify": true,
            "morph_priority": "Collection frequency",
            "morphemizer_description": "AnkiMorphs: Japanese",
            "note_type": "Japanese Morphman",
            "read": true,
            "tags": {
                "exclude": [],
                "include": []
            }
        }
    ],
    "preprocess_ignore_bracket_contents": true,
    "preprocess_ignore_names_morphemizer": false,
    "preprocess_ignore_names_textfile": true,
    "preprocess_ignore_round_bracket_contents": false,
    "preprocess_ignore_slim_round_bracket_contents": false,
    "preprocess_ignore_suspended_cards_content": false,
    "recalc_due_offset": 500000,
    "recalc_interval_for_known": 21,
    "recalc_move_known_new_cards_to_the_end": false,
    "recalc_number_of_morphs_to_offset": 100,
    "recalc_offset_new_cards": false,
    "recalc_on_sync": false,
    "recalc_read_known_morphs_folder": false,
    "recalc_suspend_known_new_cards": true,
    "recalc_toolbar_stats_use_known": false,
    "recalc_toolbar_stats_use_seen": true,
    "recalc_unknowns_field_shows_inflections": true,
    "recalc_unknowns_field_shows_lemmas": false,
    "shortcut_browse_all_same_unknown": "Shift+L",
    "shortcut_browse_ready_same_unknown": "L",
    "shortcut_browse_ready_same_unknown_lemma": "Ctrl+Shift+L",
    "shortcut_generators": "Ctrl+Shift+G",
    "shortcut_known_morphs_exporter": "Ctrl+Shift+E",
    "shortcut_learn_now": "Ctrl+Alt+N",
    "shortcut_recalc": "Ctrl+M",
    "shortcut_set_known_and_skip": "K",
    "shortcut_settings": "Ctrl+O",
    "shortcut_view_morphemes": "Ctrl+Alt+V",
    "skip_only_known_morphs_cards": true,
    "skip_show_num_of_skipped_cards": true,
    "skip_unknown_morph_seen_today_cards": true,
    "tag_known_automatically": "am-known-automatically",
    "tag_known_manually": "am-known-manually",
    "tag_learn_card_now": "am-learn-card-now",
    "tag_not_ready": "am-not-ready",
    "tag_ready": "am-ready"
}

My system

Operating System: Windows 10
Anki Version: 24.04
AnkiMorphs Version: ankimorphs-v3-0-0-testing-3

Additional context

I saw this behavior on the stable version of ankimorphs as well. I also tried both the Mecab and spaCy morphemizers.

I can provide more examples or upload my deck if that would help.

mortii commented 1 month ago

Thanks for the detailed report!

'た' is technically a morph in this context, so it's not incorrect per se, the problem is rather that the morph splitting is inconsistent with other verbs.

I also tried both the Mecab and spaCy morphemizers

I suspect that this is an issue with the morphemizers themselves (which I have no control over), not with how the text is being read from Anki, but I'll check once I'm done with v3 :+1:

mortii commented 1 month ago

I just tested this on the morphemizers directly without using Anki, and it still gives the same result, so this is an upstream problem, sorry :/

github-actions[bot] commented 3 weeks ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mortii / anki-morphs