Import regex module for \p{L}

mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.

https://mortii.github.io/anki-morphs/

Mozilla Public License 2.0

55 stars 8 forks source link

Import regex module for \p{L} #28

Closed Vilhelm-Ian closed 10 months ago

Vilhelm-Ian commented 10 months ago

https://github.com/kaegi/MorphMan/issues/309

The current regex fails to pick up on cases where a non letter character is between letters.

mortii commented 10 months ago

Could we not just add a parse option that replaces newlines with a space?

mortii commented 10 months ago

And for punctuation

Vilhelm-Ian commented 10 months ago

yes we can. That's the other route

mortii commented 10 months ago

Maybe it should be bundled with the existing "ignore quotation marks"-option since it basically does the same thing. We could rename it to "ignore special characters" or something.

Vilhelm-Ian commented 10 months ago

That's a good idea

mortii commented 10 months ago

Are you working on it or should I do it?

Vilhelm-Ian commented 10 months ago

I can give it a go. Currently working on the learn card now

Vilhelm-Ian commented 10 months ago

Maybe it should be bundled with the existing "ignore quotation marks"-option since it basically does the same thing. We could rename it to "ignore special characters" or something.

Then if you disable ignoring quotation marks and your multi-line cards are parsed with words from different lines joined, that will just be wrong. Doesn't seem like an option should be needed to remove what is always wrong, especially bundled with something more optional.

morphman currently distinguishesh newline. It dosen't have issue with new lines

Vilhelm-Ian commented 10 months ago

only when a special character is between words

mortii commented 10 months ago

Then if you disable ignoring quotation marks and your multi-line cards are parsed with words from different lines joined, that will just be wrong. Doesn't seem like an option should be needed to remove what is always wrong, especially bundled with something more optional.

only when a special character is between words

Right, so \n doesn't have to be bundled. but the other special characters should still be bundled? There would never be a reason to only ignore quotation marks but not the other problematic characters right?

Vilhelm-Ian commented 10 months ago

Correct

mortii commented 10 months ago

I can give it a go. Currently working on the learn card now

Nice. Making changes to settings and config has a steep learning curve (it took me forever to make), so I can do that part if it's too bothersome. You should definitely be the one who add characters to this regex, you are more familiar with the problem than me:

https://github.com/mortii/anki-morphs/blob/7a635f27534eda800b9d88503683931a55d2fe90/ankimorphs/morph_utils.py#L39-L42

mortii commented 10 months ago

Once this is done I'll create a 0.3.0 release

Vilhelm-Ian commented 10 months ago

I will try to do it today

Vilhelm-Ian commented 10 months ago

Dumb question why don't we just do \w+ see the difference between the current approach:

 m = "hat?Das test?me fruit.loop m>h zz$oo sh»sh öhhä&öö 市市市市 βββΣΣΣΣ-ΣΣΣΣΣΣ"
re.findall(r"\w+", m)
['hat', 'Das', 'test', 'me', 'fruit', 'loop', 'm', 'h', 'zz', 'oo', 'sh', 'sh', 'öhhä', 'öö', '市市市市', 'βββΣΣΣΣ', 'ΣΣΣΣΣΣ']

The current regex fails to notice special characters when they are inside a word

 re.findall(r"\b[^\s\d]+\b", m)
['hat?Das', 'test?me', 'fruit.loop', 'm>h', 'zz$oo', 'sh»sh', 'öhhä&öö', '市市市市', 'βββΣΣΣΣ-ΣΣΣΣΣΣ']

Vilhelm-Ian commented 10 months ago

I suggest the following regex \w+-\w+|\w+ becuse sometimes you have words like Spider-Man E-mail World-wide Up-to-date User-friendly

If there are other edge cases we can add them later on in the same way

mortii commented 10 months ago

Ah, I hadn't considered making changes to only that specific morphemizer, that is a way better solution. We should probably remove the settings option then since it would be redundant and just slows the morphemizing process.

mortii commented 10 months ago

Apostrophes should also be included right?

\w+[-']\w+|\w+

Vilhelm-Ian commented 10 months ago

Yes things like o'clock

mortii commented 10 months ago

To match words with multiple hyphens , e.g. "god-forsaken-man", we need:

\w+([-']\w+)*

Edit: It actually needs to be:

\w+(?:[-']\w+)*

This does not output the grouping like the previous regex.

Vilhelm-Ian commented 10 months ago

@mortii Can you add this as a comment near the regex " "god-forsaken-man"

Vilhelm-Ian commented 10 months ago

@mortii the current version is working perfectly. The change we made with the regex has a great effect on my collection A BUNCH(and I mean a bunch) of cards that were prior skipped are now shown to me :)

mortii commented 10 months ago

@mortii Can you add this as a comment near the regex " "god-forsaken-man"

Sure, I should add some info about the '?:' part too

@mortii the current version is working perfectly. The change we made with the regex has a great effect on my collection A BUNCH(and I mean a bunch) of cards that were prior skipped are now shown to me :)

That is amazing! You did most of the work, well done! :clap:

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.