Closed xofm31 closed 2 months ago
Oh, that's interesting. Do you get the same problem in morphman?
Edit: Whoops, I didn't read your comment carefully enough. We might have to resort to some hacky thing like we do in the mecab wrapper: https://github.com/mortii/anki-morphs/blob/d5f82a824d7559f40620da424fc29430e23d16a9/ankimorphs/mecab_wrapper.py#L117-L120
I'll investigate after I finish #208.
Thanks for the pointer so I knew where to start looking. The IDEOGRAPHS list gets rid of all punctuation. I've added in the punctuation that I think would be needed (the CJK version, the ascii version, and the double-wide of the English punctuation - I see them all the transcripts that I have).
The new problem that I created is that it now thinks the punctuation is a morph:
<span morph-status="known">一</span>
<span morph-status="known">,</span>
<span morph-status="known">二</span>
<span morph-status="known">,</span>
<span morph-status="known">三</span>
<span morph-status="known">,</span>
<span morph-status="known">跳</span>
<span morph-status="known">!</span>
I suppose for me it might not be that big of a problem, because I've probably already "learned" all the punctuation, but it does seem wrong, and it will make the morph counts off. I'm not sure what the best way to fix that is - maybe you've already encountered this with other morphemizers, so maybe you can easily fix this.
I think the filtering of the expression happens too soon in the current code: https://github.com/mortii/anki-morphs/blob/e52dee404e1852d1acbb7a68ae67908bd7657ef3/ankimorphs/morphemizer.py#L168-L185
It should be something like this instead:
def _get_morphemes_from_expr(self, expression: str) -> list[Morpheme]:
assert jieba_wrapper.posseg is not None
expression_morphs: list[Morpheme] = []
for jieba_segment_pair in jieba_wrapper.posseg.cut(expression):
# posseg.Pair:
# Pair.word
# Pair.flag
print(f"jieba_segment_pair.word: {jieba_segment_pair.word}")
found_cjk_ideographs: str = "".join(
re.findall(
f"[{jieba_wrapper.CJK_IDEOGRAPHS}]",
jieba_segment_pair.word,
)
)
if len(found_cjk_ideographs) != len(jieba_segment_pair.word):
print("contains non-cjk-ideographs")
continue
else:
print("valid cjk-ideographs")
# chinese does not have inflections, so we use the lemma for both
_morph = Morpheme(
lemma=jieba_segment_pair.word, inflection=jieba_segment_pair.word
)
expression_morphs.append(_morph)
return expression_morphs
could you try that instead?
The filtering technique is horribly inefficient and we should use something other than re.findall
.
I can try it, but I will need a better setup for debugging. So far, I've only done things that I don't need to debug very carefully, so I've been loading Anki to run the code. I see that you have print statements, which I don't think I'd see if I just run Anki. Is there a way to run the code either on the command line or using an IDE?
@xofm31 wow, that is shocking given how much complicated stuff you have worked on. If you open anki from a terminal you can see print statements there. Here is how you do it on macos: https://addon-docs.ankiweb.net/console-output.html#macos
EDIT: I'll add a section about it in the setup guide since it's such a crucial, yet non obvious debugging tool. The fact that you managed to do anything without print statements is crazy impressive.
EDIT 2: devleoper debugging section
@xofm31 I pushed a commit (https://github.com/mortii/anki-morphs/commit/b41a50ae7d0825a228d916420bcb6a531c4a48ee) to the jieba-bug branch, could you test it?
Yes, this looks fixed. Thank you!
Released now. It wouldn't shock me if this fix introduces other problems, so let me know if you find any.
Thanks!
added a section about redirecting the terminal output to a file, which is extremely useful for debugging something like the algorithm that produces a lot of text/data: https://mortii.github.io/anki-morphs/developer_guide/debugging.html
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Describe the bug
Sentences with punctuation are not processed correctly when using jieba morphemizer.
I haven't done a careful evaluation of all of the situations that create the problem, but I have noticed that when there is a comma or a hyphen, characters across these can get put into the same morpheme.
Here is an example sentence: 一,二,三,跳!
Using spacey, it correctly identifies 4 morphemes. Using jieba, I get
am-unknowns
"一二三", andam-highlighted
is "一,二,三,跳"Morphman also gives a morph of "一二三".
Here are some other problematic sentences: 桑…桑稚 -五 -六 -喂? -喂?
But it doesn't happen all the time, here is one that looks right: -报名 -报名
If you're not able to duplicate the issue, I can do more investigation.