mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
52 stars 7 forks source link

morphs not splitting correctly #125

Closed mortii closed 7 months ago

mortii commented 7 months ago

"Ahnung.Ich" should not be considered a morpheme

Originally posted by @cocowash in https://github.com/mortii/anki-morphs/issues/124#issuecomment-1890433389

mortii commented 7 months ago

spaCy does not split text on periods:

expression = "Keine Ahnung.Ich".lower()
doc = nlp(expression)

for w in doc:
    print(f"w.text: {w.text}")

# output:
# w.text: keine
# w.text: ahnung.ich

this is for text like "10 a.m." to get not get split into -> [10, a, m]

if there is a new line or whitespace then it works:

expression = "Keine Ahnung.\nIch".lower()

# output:
# w.text: keine
# w.text: ahnung
# w.text: .
# w.text: ich

So this basically means that malformed text will produce the wrong morphs, and there is not a whole lot we can do about it unfortuntely...

Vilhelm-Ian commented 7 months ago

~~We can just add to the guide instructions for people to use a regex like this image thre is a space after the dot in the second field~~

nope lookaround/behind is not supported.

A potential solution is to replace all dots with dots+space. But then dots+space will transform to dots+space+space. Which I don't think is really that bad.

Actually after that they can run a second regex. That will replace every space+space with just space

mortii commented 7 months ago

@Vilhelm-Ian Interesting! I've never used the find and replace feature in anki, it definitely seems like it could be useful here.

Vilhelm-Ian commented 7 months ago

Once I told HQ how to remove all the html from his cards that was added by the highlight feature with regex. Couldn't find the therad

mortii commented 7 months ago

Once I told HQ how to remove all the html from his cards that was added by the highlight feature with regex. Couldn't find the therad

Ah yes, I remember.

issue #124 and this have now converged, so I'll close this one.

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.