languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.05k stars 1.38k forks source link

[de] check lists of German compounds #5775

Open jaumeortola opened 2 years ago

jaumeortola commented 2 years ago

I cannot make a complete build with tests because of errors in GermanCompoundRuleTest(). There are errors in these two lines:

check(1, "Nur im Stand by Betrieb", "Stand-by-Betrieb");
check(1, "Blu ray Brenner", "Blu-ray-Brenner");

I get a spelling error in "Stand-by-Betrieb", but not in "Blu-ray-Brenner". It is very strange.

3603 rules activated for language German (Germany)
1.) Line 1, column 1, Rule ID: GERMAN_SPELLER_RULE prio=-3
Message: Möglicher Tippfehler gefunden.
Suggestion: Stand-Öl-Betrieb
Stand-by-Betrieb 
^^^^^^^^^^^^^^^^ 
danielnaber commented 2 years ago

Could you post the exact failure message? The tests work for me.

jaumeortola commented 2 years ago

It happens only in my desktop computer (java 8 and java 11), not in my laptop.

[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.827 s <<< FAILURE! - in org.languagetool.rules.de.GermanCompoundRuleTest
[ERROR] testRule(org.languagetool.rules.de.GermanCompoundRuleTest)  Time elapsed: 4.795 s  <<< FAILURE!
java.lang.AssertionError: Expected 1 error(s), but got: [] expected:<1> but was:<0>
    at org.junit.Assert.fail(Assert.java:89)
    at org.junit.Assert.failNotEquals(Assert.java:835)
    at org.junit.Assert.assertEquals(Assert.java:647)
    at org.languagetool.rules.AbstractCompoundRuleTest.check(AbstractCompoundRuleTest.java:59)
    at org.languagetool.rules.de.GermanCompoundRuleTest.runTests(GermanCompoundRuleTest.java:65)
    at org.languagetool.rules.de.GermanCompoundRuleTest.testRule(GermanCompoundRuleTest.java:36)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
danielnaber commented 2 years ago

Is that under Windows or Linux? Any other idea what the difference between both computers might be?

jaumeortola commented 2 years ago

Both computers under Linux.

The problem is probably in the speller rule. The GermanCompoundRule tests the suggestion with GermanSpellerRule, and if the suggestions is misspelled, the rule doesn't match.

jaumeortola commented 2 years ago

On the computer with the failing test, hunspell.spell("Blu-ray-Brenner") and hunspell.spell("Stand-by-Betrieb") are false. On the other computer, they are true.

What is the expected result? On the German spelling dictionary (de_DE), you can find Blu-ray and Stand-by, but not Blu-ray-Brenner and Stand-by-Betrieb.

Should these words be added to spelling.txt? The question, then, is why are now accepted if they are not in the spelling dictionary. Because word compounding?

jaumeortola commented 2 years ago

@udomai There are still some German compounds to revise manually. They could be suggested by the compound rule, but they are not because they are misspellings for the spelling rule.

These are the warnings (in both computers): de-compound-warnings-both-computers.txt

And these are the warnings that appear only in the computer with the failing test: de-compound-warnings-one-computer.txt All these compounds are three part compounds, and the first two parts are a word in the dictionary: Hi-Fi-Turmes, Know-how-Transfer, No-Name-Produkt, Stand-by-Betrieb...

jaumeortola commented 2 years ago

After updating Ubuntu 18.04 to 20.04 the problem goes away. The Hunspell library that we use is somewhat instable. Anyway, I don't know if the current results for German compounds are the desired results (in the speller rule and in the compound rule). The words on my previous message have to be revised.

tiff commented 2 years ago

@jaumeortola I'm currently checking this list and making sure that these terms are in our dictionary. However, something is wrong with the logic for deciding if a word is in the dict or not: E.g. REFA-Fachleute* is in the list of German compounds and not marked as a spelling mistake when you type it, but when you type REFA Fachleute (without a hyphen) no suggestion is made. I guess the logic is only checking if the entire word is in the dict, and not whether both words individually exist in there. At least in German and English, you can connect two words (that are in the dictionary) with a hyphen and no spelling mistake is shown.

jaumeortola commented 2 years ago

At least in German and English, you can connect two words (that are in the dictionary) with a hyphen and no spelling mistake is shown.

I see. The speller we use in the rule doesn't expect multi-token words. I can change it if you want. But having the words in the spelling dictionary (specially if they are common words) is better.

tiff commented 2 years ago

I'd prefer having it changed. At the same time I also prefer having them in the dictionary. Especially in English we do not have many multi-word tokens in the dictionary. I will take care of adding the missing ones over the next weeks