Bug processing sentences that have punctuation with jieba

mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.

https://mortii.github.io/anki-morphs/

Mozilla Public License 2.0

47 stars 6 forks source link

Bug processing sentences that have punctuation with jieba #219

Closed xofm31 closed 2 months ago

xofm31 commented 2 months ago

Describe the bug

Sentences with punctuation are not processed correctly when using jieba morphemizer.

I haven't done a careful evaluation of all of the situations that create the problem, but I have noticed that when there is a comma or a hyphen, characters across these can get put into the same morpheme.

Here is an example sentence: 一，二，三，跳！

Using spacey, it correctly identifies 4 morphemes. Using jieba, I get am-unknowns "一二三", and am-highlighted is "一，二，三，跳"

Morphman also gives a morph of "一二三".

Here are some other problematic sentences: 桑…桑稚 -五 -六 -喂？ -喂？

But it doesn't happen all the time, here is one that looks right: -报名 -报名

If you're not able to duplicate the issue, I can do more investigation.

Operating System: MacOs
Anki Version: ⁨24.04
AnkiMorphs Version: 2.2.2

mortii commented 2 months ago

Oh, that's interesting. Do you get the same problem in morphman?

Edit: Whoops, I didn't read your comment carefully enough. We might have to resort to some hacky thing like we do in the mecab wrapper: https://github.com/mortii/anki-morphs/blob/d5f82a824d7559f40620da424fc29430e23d16a9/ankimorphs/mecab_wrapper.py#L117-L120

I'll investigate after I finish #208.

xofm31 commented 2 months ago

Thanks for the pointer so I knew where to start looking. The IDEOGRAPHS list gets rid of all punctuation. I've added in the punctuation that I think would be needed (the CJK version, the ascii version, and the double-wide of the English punctuation - I see them all the transcripts that I have).

https://github.com/xofm31/anki-morphs/blob/5ec8d5ddf61d490ec40b78ef144e5f250541879d/ankimorphs/jieba_wrapper.py#L19-L32

The new problem that I created is that it now thinks the punctuation is a morph:

<span morph-status="known">一</span>
<span morph-status="known">，</span>
<span morph-status="known">二</span>
<span morph-status="known">，</span>
<span morph-status="known">三</span>
<span morph-status="known">，</span>
<span morph-status="known">跳</span>
<span morph-status="known">！</span>

I suppose for me it might not be that big of a problem, because I've probably already "learned" all the punctuation, but it does seem wrong, and it will make the morph counts off. I'm not sure what the best way to fix that is - maybe you've already encountered this with other morphemizers, so maybe you can easily fix this.

mortii commented 2 months ago

I think the filtering of the expression happens too soon in the current code: https://github.com/mortii/anki-morphs/blob/e52dee404e1852d1acbb7a68ae67908bd7657ef3/ankimorphs/morphemizer.py#L168-L185

It should be something like this instead:

    def _get_morphemes_from_expr(self, expression: str) -> list[Morpheme]:
        assert jieba_wrapper.posseg is not None
        expression_morphs: list[Morpheme] = []

        for jieba_segment_pair in jieba_wrapper.posseg.cut(expression):
            # posseg.Pair:
            #   Pair.word
            #   Pair.flag

            print(f"jieba_segment_pair.word: {jieba_segment_pair.word}")

            found_cjk_ideographs: str = "".join(
                re.findall(
                    f"[{jieba_wrapper.CJK_IDEOGRAPHS}]",
                    jieba_segment_pair.word,
                )
            )

            if len(found_cjk_ideographs) != len(jieba_segment_pair.word):
                print("contains non-cjk-ideographs")
                continue
            else:
                print("valid cjk-ideographs")

            # chinese does not have inflections, so we use the lemma for both
            _morph = Morpheme(
                lemma=jieba_segment_pair.word, inflection=jieba_segment_pair.word
            )
            expression_morphs.append(_morph)

        return expression_morphs

could you try that instead?

The filtering technique is horribly inefficient and we should use something other than re.findall.

xofm31 commented 2 months ago

I can try it, but I will need a better setup for debugging. So far, I've only done things that I don't need to debug very carefully, so I've been loading Anki to run the code. I see that you have print statements, which I don't think I'd see if I just run Anki. Is there a way to run the code either on the command line or using an IDE?

mortii commented 2 months ago

@xofm31 wow, that is shocking given how much complicated stuff you have worked on. If you open anki from a terminal you can see print statements there. Here is how you do it on macos: https://addon-docs.ankiweb.net/console-output.html#macos

EDIT: I'll add a section about it in the setup guide since it's such a crucial, yet non obvious debugging tool. The fact that you managed to do anything without print statements is crazy impressive.

EDIT 2: devleoper debugging section

mortii commented 2 months ago

@xofm31 I pushed a commit (https://github.com/mortii/anki-morphs/commit/b41a50ae7d0825a228d916420bcb6a531c4a48ee) to the jieba-bug branch, could you test it?

xofm31 commented 2 months ago

Yes, this looks fixed. Thank you!

mortii commented 2 months ago

Released now. It wouldn't shock me if this fix introduces other problems, so let me know if you find any.

Thanks!

mortii commented 2 months ago

added a section about redirecting the terminal output to a file, which is extremely useful for debugging something like the algorithm that produces a lot of text/data: https://mortii.github.io/anki-morphs/developer_guide/debugging.html

github-actions[bot] commented 2 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.