mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
52 stars 7 forks source link

Morphemes are not recognized as it #124

Closed cocowash closed 7 months ago

cocowash commented 7 months ago

Describe the bug In some cards for unknown reason, certain words are not recognized as a morpheme.

To Reproduce Steps to reproduce the behavior: It is easier to recognize it if am-highlighted its active. I don't know what is causing it, so it's a wild guess on how to reproduce it if I select view morphemes on the Ankimoprhs tab: it confirms that no morphs are found for that note.

Expected behavior any word should be detected ad as morpheme.

Screenshots Captura de pantalla 2024-01-12 144738 In this case, it is "bewegen" and "Was"

Captura de pantalla 2024-01-13 101008 this is an example of how the code in am-highlighted is shown.

Desktop (please complete the following information):

Additional context I would say that maybe 10% of words in my collection are not recognized as morphemes, it is not a big quantity, but it bothers me a little bit. I'm willing to share the affected deck if its useful to solve the error.

mortii commented 7 months ago

@cocowash thanks for the feedback!

I would say that maybe 10% of words in my collection are not recognized as morphemes

Damn, that is a lot...

I actually have some insights into this. In the spaCy testing suite, we found that the German models mistakenly classifies 'was' as a proper noun: https://github.com/mortii/anki-morphs/blob/7581c09159146d7128a66fbc5c864d4543591d01/tests/spacy_test.py#L164-L175

If you go to settings -> preprocess, and unselect 'ignore names found by morphemizer', do more morphs show up?

If this turns out to be a universal problem then we should give a warning about it in the guide.

cocowash commented 7 months ago

Turning off ignore names by morphemes increased the morphemes found. The ones that are not recognized seem to have the same behavior as I reported on https://github.com/kaegi/MorphMan/issues/309 Captura de pantalla 2024-01-13 125517 "Ahnung.Ich" should not be considered a morpheme

[by the way, in order to actualize the information in am-highlighted a Recalc wasn't sufficient, I had to batch edit the content to blank in order to be written again after doing a Reclalc, is this behavior intended? (to populate information only on blank fields?) is there a way to force anki-morphs to repopulate all the information in anki?]

mortii commented 7 months ago

Turning off ignore names by morphemes increased the morphemes found.

Nice. Unfortunate that it's not very accurate for german.

"Ahnung.Ich" should not be considered a morpheme

That should have been fixed, not sure why it's not working.... I'll look into it.

by the way, in order to actualize the information in am-highlighted a Recalc wasn't sufficient, I had to batch edit the content to blank in order to be written again after doing a Reclalc, is this behavior intended?

Absolutely not, am-highlighted should update regardless of any previous content. I actually have no idea how that could have happened.... Are you able to reproduce it?

mortii commented 7 months ago

I'm willing to share the affected deck if its useful to solve the error.

@cocowash that would be great, especially for making sure the morph splitting works

mortii commented 7 months ago

@cocowash I added a comment about the morph splitting problem in https://github.com/mortii/anki-morphs/issues/125#issuecomment-1890478274

cocowash commented 7 months ago

Absolutely not, am-highlighted should update regardless of any previous content. I actually have no idea how that could have happened.... Are you able to reproduce it?

It seems that if the field analyzed by Ankimorph is empty, it won't change anything in am-highlighted. That's why I thought it didn't work. (I was playing between using the sentence or the specific vocabulary word and since I have sentences with no specific vocabulary to know that lead me to think after a recalc that It was a general behavior.)

My vocab deck is here The ones that are trickier to detect fully correctly can be found by the tag "MiningReady" They're made from tv subtitles

mortii commented 7 months ago

My vocab deck is here The ones that are trickier to detect fully correctly can be found by the tag "MiningReady" They're made from tv subtitles

Thanks!

The card you showed here (cid:1698862497669): image

has a weird unicode character called control-000A between "Ahnung." and "Ich" , hence the splitting problem.

It seems that if the field analyzed by Ankimorph is empty, it won't change anything in am-highlighted. That's why I thought it didn't work.

Weird, I'll take a look.

mortii commented 7 months ago

It seems that if the field analyzed by Ankimorph is empty, it won't change anything in am-highlighted.

This does indeed seem to be a bug.

I'm also getting "U: 4035 A: 12732", even though none of the cards have been studied... crazy bug.

mortii commented 7 months ago

Oooooooh, I realized what happened.

When you accidentally used the empty field, ankimorphs determined that you knew all the text on the card, and it was therefore given the tag 'am-known-automatically'. Now it will treat any text on that card as known, even when you switch fields, and it will set all the morphs to known.

You have to remove the tag from the cards for them to work properly again.

cocowash commented 7 months ago

has a weird unicode character called control-000A between "Ahnung." and "Ich" , hence the splitting problem.

Well, if I copy the text from the source field to a Word document with show everything activated, spaces are shown like spaces. Only if I copy the text from am-highlighted is when I see the control-000A acting as spaces instead. Captura de pantalla 2024-01-13 180046

Another thing that It looked interesting to me is that the morphs that failed seems to follow the structure word + sign + <br>

so far I found 4 in my library:

besitzt.<br>
Yura.<br>
gepflegt.<br>

Captura de pantalla 2024-01-13 180457 In this case also, I don't know why the model chose to split the verb Sieh to match it as a noun sie h

mortii commented 7 months ago

Well, if I copy the text from the source field to a Word document with show everything activated, spaces are shown like spaces. Only if I copy the text from am-highlighted is when I see the control-000A acting as spaces instead.

I'm not completely sure what you are trying to say here.

If you copy the text from the source field into the unicode code converter you can see it there.

The 000A character is basically a bad version of a newline character because it gets handled in inconsistent ways by different systems; as we see in anki, it basically works as a newline, but it is completely ignored by spaCy.

This is only really a problem with ".\<000A>" because all the other non-alphanumeric characters split text. As you can see in the link above, the same problem does not happen on "an!\<000A>Schnauze!" that also occurs in the same text.

image

This is another example if how the character causes inconsistent behaviour, apparently being converted into \
by the regex python module.

In this case also, I don't know why the model chose to split the verb Sieh to match it as a noun sie h

I wish the german models were better at proper nouns :cry:

cocowash commented 7 months ago

I'm not completely sure what you are trying to say here.

Never mind, your way of explanation clarified to me how to look it properly. Now, regarding the issue, would it be possible to normalize the <000A> as a regular space before spaCy analyze the data?

mortii commented 7 months ago

@cocowash

Sorry about my wording:

I'm not completely sure what you are trying to say here.

I just realized that it could be interpreted as aggressive and/or condescending, which is the opposite of what I intended. Forgive me for having clumsy english :pray:

Now, regarding the issue, would it be possible to normalize the <000A> as a regular space before spaCy analyze the data?

It is possible, we used to have a preprocess option for adding whitespace after punctuation marks, but we removed it after updating the regex used by the (now inferior) 'Languages \w spaces' morphemizer, making it obsolete at the time.

@Vilhelm-Ian had the great suggestion of using the find and replace option in anki, and I did find a setting which worked on this card: image

it does just what I described above, add a whitespace before the U+000A character:

Screenshot from 2024-01-15 13-08-26

the find field uses regex, the replace with does not. It's hard to tell from the picture, but there is a white space between "." and "\
" in the replace with field.

After I ran that command, and recalced, i got these morphs: Screenshot from 2024-01-15 13-27-04

I'm not sure if we should add back the preprocess option or if we should just recommend the search and replace approach instead. Search and replace is a one time fix, where as the preprocess option will make every recalc more expensive. Which one would you guys prefer @cocowash @Vilhelm-Ian ?

cocowash commented 7 months ago

I just realized that it could be interpreted as aggressive and/or condescending, which is the opposite of what I intended. Forgive me for having clumsy english 🙏

No worries, in fact I thought the contrary: I felt that I was being attended. In general, it's just written text that lacks the intonation and other aspects of social interaction that makes it feel a little bit colder in comparison to the spoken language.

I'm not sure if we should add back the preprocess option or if we should just recommend the search and replace approach instead. Search and replace is a one time fix, where as the preprocess option will make every recalc more expensive. Which one would you guys prefer @cocowash @Vilhelm-Ian ?

Well, I think that ankimorphs shine really well for mining sentences, so that would imply having a big collection of cards that possibly has this problem. In my perspective, it's more valuable to do this process by hand and have a lighter Recalc [in the sense of time and CPU resources] than the other way around.

What I would recommend is to add this solution to the documentation.

Vilhelm-Ian commented 7 months ago

which one would you guys prefer

both haave the same issue. If you enable it by default people would complain why are my cards magically changed. If we don't people will complain that it's not working as expected. If we add the option to be optional than we are adding something extra to the addon that anki has already built in.

I like having the code as simple as small as possible. So I just say adding the above regexes and the one I helped HQ about the html tags in the docs.

mortii commented 7 months ago

I agree, no preprocess option then.

I'll add a description of the problem and the solution to the 'known problems' section in the guide.

A couple of other things I've thought about:

  1. Maybe the german transformer model isn't as terrible as the conventional ones when it comes to nouns. If it is significantly better then it might be worth it to support them in ankimorphs. I'll investigate.
  2. the info icon in the 'view morphemes' dialog is ugly and unnecessary, so I'll remove it. It might be better to use a table widget with a scrollbar too, to prevent the dialog exploding vertically when there are many morphs.
mortii commented 7 months ago

the german transformer model is just as bad unfortunately :(

mortii commented 7 months ago

Is this understandable? Any suggestions for making it better?

===================================================================================

Morphs don't split on punctuation marks

Most morphemizers don't split text on punctuation marks because it would split phrases like 10 a.m. into -> [10, a, m], which would be unideal.

This can cause problems if there are line breaks in the Anki text:

Hello.
Goodbye.

The text is actually stored as:

Hello.<br>Goodbye.

Most morphemizers completely ignore the unicode equivalent of <br>, which results in them interpreting the text as:

Hello.Goodbye.

To fix this problem, you can add a whitespace between the punctiation mark and the <br> tag. This can be done in bulk with the find and replace feature in the anki browser:

find_and_replace_split

===================================================================================

Vilhelm-Ian commented 7 months ago

LGTM

cocowash commented 7 months ago

I would suggest having a box to just copy and paste the commands before or after the last image
.<br> . <br>

Is this the same as the other picture, but just without using regular expressions?

mortii commented 7 months ago

I would suggest having a box to just copy and paste the commands before or after the last image .<br> . <br>

great idea

Is this the same as the other picture, but just without using regular expressions?

yes, since the expression is not variable it doesn't need to use regex

mortii commented 7 months ago

Docs are updated: https://mortii.github.io/anki-morphs/user_guide/known-problems.html

Thanks for all the feedback! Let me know if you have any more :)

ashprice commented 7 months ago

@cocowash thanks for the feedback!

I would say that maybe 10% of words in my collection are not recognized as morphemes

Damn, that is a lot...

I actually have some insights into this. In the spaCy testing suite, we found that the German models mistakenly classifies 'was' as a proper noun:

https://github.com/mortii/anki-morphs/blob/7581c09159146d7128a66fbc5c864d4543591d01/tests/spacy_test.py#L164-L175

If you go to settings -> preprocess, and unselect 'ignore names found by morphemizer', do more morphs show up?

If this turns out to be a universal problem then we should give a warning about it in the guide.

Sorry for the slow reply on the other thread, I kinda forgot... Looking through issues for other reasons...

It's not relevant to the issue really, but I felt like pointing out that it isn't saying was is a proper noun, rather a pronoun (the labels are PROPN vs PRON), specifically an interrogative pronoun (try token.morph.get("PronType"). You can ask it to clarify labels with spacy.explain.

mortii commented 7 months ago

@ashprice great catch! The abbreviations are so similar I got lost in them. I have to check if I haven't mixed them up other places, it could have a big impact.

mortii commented 7 months ago

Today I learned that 'what' is an 'interrogative pronoun'. Fascinating.

Pronouns have pos tag 95:

w.text: Was
w.pos: 95
w.pos_: PRON

Proper nouns have pos tag 96:

w.text: Harry
w.pos: 96
w.pos_: PROPN

so the current proper noun filter is correct: https://github.com/mortii/anki-morphs/blob/b33f28074033297ad826c453f46914e4eae5f9d1/ankimorphs/text_preprocessing.py#L24-L26

In this instance: image

'was' is not recognized because of the punctuation mark splitting problem.

This however, seems to actually be an AnkiMorphs bug:

image In this case also, I don't know why the model chose to split the verb Sieh to match it as a noun sie h

'Sieh' gets interpreted as:

w.text: Sieh
w.pos: 100
w.pos_: VERB

I have no idea why the 'h' is not included.... I'll investigate.

mortii commented 7 months ago

This however, seems to actually be an AnkiMorphs bug:

image In this case also, I don't know why the model chose to split the verb Sieh to match it as a noun sie h

Actually, never mind, I just realized that this is also the same punctuation mark splitting problem.

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.