[BUG] Punctuation marks are considered part of the morph

kaegi / MorphMan

Anki plugin that reorders language cards based on the words you know

Other

261 stars 66 forks source link

[BUG] Punctuation marks are considered part of the morph #309

Closed cocowash closed 1 year ago

cocowash commented 1 year ago

Describe the bug Punctuation marks are taken as part of the morph and agglutinate multiple words into one morph. It seems that it only happens on the last word of the sentence followed by a line break ie: Ich denke, er ist da unten. Und warum gehst du dann nicht runter? Weil es mir hier unheimlich ist.

Moprhs formed: unten.und, runter?weil

Expected behavior Punctuation marks Shouldn't agglutinate two different words into one morph. Moprhman shouldn't identify two words as one morph if a line break is used instead of a space.

Screenshots Captura de pantalla 2023-11-02 153629

Environment Anki version: 23.10 Morphman Qt 6 Alpha 4

Vilhelm-Ian commented 1 year ago

can you share the card/deck. Because I also study german and haven't had that issue. Maybe the card is using some special type of question mark.

cocowash commented 1 year ago

can you share the card/deck. Because I also study german and haven't had that issue. Maybe the card is using some special type of question mark.

Sure, you can find it at https://anonfiles.me/qaYl/s01e01.apkg I would bet it's because of the combination of a punctuation mark + a line break, but feel free to try it yourself

Vilhelm-Ian commented 1 year ago

I figured out what the problem is. It dosen't have to do with line breaks. It has to do with this regex "\b[^\s\d]+\b" it fails to match when a non letter character is between two words it will match Das?Hat and the?locomotive . The solution I want to add to morphman would be to use \p{L} but re module in python dosen't support but the regex supports it. But I can't figure out how to import it

https://regex101.com/r/zPvSER/1

Vilhelm-Ian commented 1 year ago

@cocowash a solution for your problem is go to the card browser press CTRL+ALT+F a find and replace dialogue will pop up. Type in the find section ? and in the replace section add ? (press the space key don't type ) and deselect the option for only selected cards

Vilhelm-Ian commented 1 year ago

I figured out what the problem is. It dosen't have to do with line breaks. It has to do with this regex "\b[^\s\d]+\b" it fails to match when a non letter character is between two words it will match Das?Hat and the?locomotive . The solution I want to add to morphman would be to use \p{L} but re module in python dosen't support but the regex supports it. But I can't figure out how to import it

Now that I think about it. Since it's easily solvable by just using find and replace. Maybe we shouldn't change the regex since there are cases where a - is put between them

aleksejrs commented 1 year ago

Type in the find section ? and in the replace section add ? (press the space key don't type )

That makes no sense. Use Markdown backticks .

cocowash commented 1 year ago

Now that I think about it. Since it's easily solvable by just using find and replace. Maybe we shouldn't change the regex since there are cases where a - is put between them

Thanks for the support, sadly find and replace won't work on all cases. In Some cases the problem is solved, in others the word is added with the punctuation mark and the space. Captura de pantalla 2023-11-03 140848

Vilhelm-Ian commented 1 year ago

I am going to talk to the maintener of the new addon if we can import the regex module.

BTW where did you find inuyasha in german. I find it hard to find 90s anime in german dub

cocowash commented 1 year ago

I am going to talk to the maintener of the new addon if we can import the regex module. BTW where did you find inuyasha in german. I find it hard to find 90s anime in german dub

Thanks for the support, sure, If you provide an email or an account where I could send a private message I could comment a little bit more about Inuyasha.

Vilhelm-Ian commented 1 year ago

mariothrowsfieball@gmail.com

Vilhelm-Ian commented 1 year ago

@cocowash the problem has been fixed. To fix it go to your morphman folder open the file morphemizer.py and change this line "word.lower() for word in re.findall(r"\b[^\s\d]+\b", expression, re.UNICODE)" to "word.lower() for word in re.findall(r"\w+", expression, re.UNICODE)"

cocowash commented 1 year ago

Thanks, I tried with the line replacement and it woks fine.