languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.29k stars 1.39k forks source link

EN_COMPOUNDS: preventing false positives #1318

Open MikeUnwalla opened 5 years ago

MikeUnwalla commented 5 years ago

EN_COMPOUNDS is a Java rule. It gets data from compounds.txt. Some false positives are caused because the rule and the data have no information about POS. Example: I think that girl cut her hair to give herself a new look. The rule says that 'new look' is normally spelled with a hyphen. The hyphenated compound adjective 'new-look' comes before a noun. In the example sentence, the structure is adjective+noun, and the text is correct. More examples of FP:

Is it possible to improve the Java rule such that compounds.txt also supplies POS information and the Java rule uses that information to prevent false positives?

(For information about compound adjectives, refer to https://www.grammarbook.com/punctuation/hyphens.asp, 'Hyphens Between Words', Rule 1.)

f-knorr commented 5 years ago

I am not sure whether the JAVA rule should be improved w. r. t. POS information. For this, we already have grammar.xml which allows for sophisticated rules. The compounds.txt used by EN_COMPOUNDS is just a quick way of "writing rules" for hyphenated compounds. If there are false positives, I suggest removing the corresponding entries from compounds.txt and writing specific rules for these entries.

MikeUnwalla commented 5 years ago

TODO: well-x.

For each term that I removed from compounds.txt, this list shows a counter-example or a reference to show that the term can be spelled with a space: Do not let the back drop to the floor. It's about time the weather started cooling off. The maniacal murderer cut throats with abandon. When the trainee soldier fire arms, they must be very careful. We use this kiln to fire brick only. These factories fire bricks of many shapes and sizes. Are we to let these paramilitary forces fire water cannons at our citizens? Although he got a good first hand, he lost most of the games of cards. fox hunting https://en.oxforddictionaries.com/definition/fox_hunting Apply grease to the free wheel. Apply grease to the free wheels. master stroke https://en.oxforddictionaries.com/definition/master_stroke At this college, we do not let a master work unless he has a... The games master works very hard to encourage all the students. billy-goat https://en.oxforddictionaries.com/definition/billy_goat blood-money https://en.oxforddictionaries.com/definition/blood_money blood-heat https://en.oxforddictionaries.com/definition/blood_heat Have you got the blue pencils? brand-new https://en.oxforddictionaries.com/definition/brand_new Paint the brick red. Did you see the buck passing through the glen? The hungry bug eyed its prey. ...details are stored on a database with the card carrying personal information. coast-to-coast https://en.oxforddictionaries.com/definition/coast_to_coast cow-parsley https://en.oxforddictionaries.com/definition/cow_parsley cut-and-paste https://en.oxforddictionaries.com/definition/cut_and_paste do-or-die https://en.oxforddictionaries.com/definition/do_or_die down-and-out https://en.oxforddictionaries.com/definition/down_and_out down-and-outs For the dual purpose of testing and evaluation. ... and duty bound him to serve his country. ... and let your duty free you from having to choose. The eagle eyed its prey. ... his eye catching a glimpse of the burglar. fellow-traveller https://en.oxforddictionaries.com/definition/fellow_traveller fellow-travellers These pens have felt tips. We will fly by night to America and will arrive early morning. The force fed its new recruits well. She is a great aunt. [=She is a wonderful aunt] I love my great aunts. [=I love my wonderful aunts.] great-grandaunt great-grandchild great-grandchildren great-granddaughter great-grandfather great-grandmother great-grandmothers great-grandnephew great-grandniece great-grandparent great-grandparents great-grandson great-granduncle great-nephew great-niece great-uncle When the moon is half light and half dark... Do we ever think about how the other half lives, and about how they might live? The trinket was passed from hand to hand. The parasites jump from head to head, and thus transmit disease. This is a heavy duty I bear. ice-skate https://en.oxforddictionaries.com/definition/ice_skate ice-skating You can use a carborundum stone to sharpen a knife edge. All right-thinking people want to lead free lives. mace-bearer https://en.oxforddictionaries.com/definition/mace_bearer mace-bearers I saw mom and pop yesterday. The girl cut her hair to give herself a new look. We have run out of body lotion. We must not let this bullying president elect the other members of the ... rags-to-riches https://en.oxforddictionaries.com/definition/(from)_rags_to_riches sailing-master https://en.oxforddictionaries.com/definition/sailing_master sailing-masters* Be careful when you switch blades; they are sharp. These two timer units are broken. The bullet ricocheted from wall to wall. Insert the hard liner into the container. hard-liners

ghost commented 5 years ago

It is the same for all compounded languages I guess. At least, for Dutch too. It all depends on what is meant with the text. So there will be exceptions/false positives for every of the compounds. I suggested to Daniel that using relative frequencies and maybe using ngrams could help preventing false positives. Like the ngram rule, the compounds rule would benefit with an explanation for both alternatives.

MikeUnwalla commented 5 years ago

@baarsrj, thanks for suggesting to @danielnaber to use frequencies. I had planned to write grammar rules for most of the terms that I removed. (Not for ones such as 'ice-skating' that can have two variants for one lemma, but for the ones that have different lemmas depending on the meaning, as in the examples I gave in my first comment.) @danielnaber, if you will use frequencies, tell me, so that I don't waste my time writing unnecessary rules.

ghost commented 5 years ago

@MikeUnwalla An alternative is to use 'wrong word in context'; that has the options for tweaking using regexp, and room for explanation as well. That way, there is no need for separate xml rules.