Closed Vilhelm-Ian closed 10 months ago
Could we not just add a parse option that replaces newlines with a space?
And for punctuation
yes we can. That's the other route
Maybe it should be bundled with the existing "ignore quotation marks"-option since it basically does the same thing. We could rename it to "ignore special characters" or something.
That's a good idea
Are you working on it or should I do it?
I can give it a go. Currently working on the learn card now
Maybe it should be bundled with the existing "ignore quotation marks"-option since it basically does the same thing. We could rename it to "ignore special characters" or something.
Then if you disable ignoring quotation marks and your multi-line cards are parsed with words from different lines joined, that will just be wrong. Doesn't seem like an option should be needed to remove what is always wrong, especially bundled with something more optional.
morphman currently distinguishesh newline. It dosen't have issue with new lines
only when a special character is between words
Then if you disable ignoring quotation marks and your multi-line cards are parsed with words from different lines joined, that will just be wrong. Doesn't seem like an option should be needed to remove what is always wrong, especially bundled with something more optional.
only when a special character is between words
Right, so \n doesn't have to be bundled. but the other special characters should still be bundled? There would never be a reason to only ignore quotation marks but not the other problematic characters right?
Correct
I can give it a go. Currently working on the learn card now
Nice. Making changes to settings and config has a steep learning curve (it took me forever to make), so I can do that part if it's too bothersome. You should definitely be the one who add characters to this regex, you are more familiar with the problem than me:
Once this is done I'll create a 0.3.0 release
I will try to do it today
Dumb question why don't we just do \w+ see the difference between the current approach:
m = "hat?Das test?me fruit.loop m>h zz$oo sh»sh öhhä&öö 市市市市 βββΣΣΣΣ-ΣΣΣΣΣΣ"
re.findall(r"\w+", m)
['hat', 'Das', 'test', 'me', 'fruit', 'loop', 'm', 'h', 'zz', 'oo', 'sh', 'sh', 'öhhä', 'öö', '市市市市', 'βββΣΣΣΣ', 'ΣΣΣΣΣΣ']
The current regex fails to notice special characters when they are inside a word
re.findall(r"\b[^\s\d]+\b", m)
['hat?Das', 'test?me', 'fruit.loop', 'm>h', 'zz$oo', 'sh»sh', 'öhhä&öö', '市市市市', 'βββΣΣΣΣ-ΣΣΣΣΣΣ']
I suggest the following regex \w+-\w+|\w+
becuse sometimes you have words like
Spider-Man
E-mail
World-wide
Up-to-date
User-friendly
If there are other edge cases we can add them later on in the same way
Ah, I hadn't considered making changes to only that specific morphemizer, that is a way better solution. We should probably remove the settings option then since it would be redundant and just slows the morphemizing process.
Apostrophes should also be included right?
\w+[-']\w+|\w+
Yes things like o'clock
To match words with multiple hyphens , e.g. "god-forsaken-man", we need:
\w+([-']\w+)*
Edit: It actually needs to be:
\w+(?:[-']\w+)*
This does not output the grouping like the previous regex.
@mortii Can you add this as a comment near the regex " "god-forsaken-man"
@mortii the current version is working perfectly. The change we made with the regex has a great effect on my collection A BUNCH(and I mean a bunch) of cards that were prior skipped are now shown to me :)
@mortii Can you add this as a comment near the regex " "god-forsaken-man"
Sure, I should add some info about the '?:' part too
@mortii the current version is working perfectly. The change we made with the regex has a great effect on my collection A BUNCH(and I mean a bunch) of cards that were prior skipped are now shown to me :)
That is amazing! You did most of the work, well done! :clap:
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
https://github.com/kaegi/MorphMan/issues/309
The current regex fails to pick up on cases where a non letter character is between letters.