languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.38k stars 1.39k forks source link

'case_sensitive' with 'inflected' gives unexpected results for a word at the start of a sentence #657

Closed MikeUnwalla closed 7 years ago

MikeUnwalla commented 7 years ago

Using LT 3.7 snapshot 2017-01-08.

<rule id="CASE_SENSITIVE_TEST" name="case-sensitive">
  <pattern>
    <token inflected="yes" case_sensitive="yes">test</token>
  </pattern>
  <message>Found: \1</message>
  <example correction="">This is a <marker>test</marker>.</example>
  <example>When you click <marker>Test</marker>, the ...</example>
  <example><marker>Test</marker> the rules carefully.</example>
</rule>

In the GUI, the rule above finds 'Test' if it is at the start of a sentence. Testrules gives this warning (in part):

  Test the rules carefully.
  Analyzed token readings: [/SENT_START*] Test[Test/NNP*,test/JJ*,test/NN*,test/VB*,test/VBP*,B-VP]  [ /null*] the[the/D
T,B-NP-plural]  [ /null*] rules[rule/NNS,E-NP-plural]  [ /null*] carefully[carefully/RB,B-ADVP] .[./.*,./SENT_END*,O]
Matching Rule: CASE_SENSITIVE_TEST[1]

If I remove inflected="yes" from the rule, testrules does not give a warning, and the GUI correctly does not find 'Test'.

jaumeortola commented 7 years ago

I have tested the rule in the on-line rule editor and it seems consistent.

The attribute inflected="yes" means that you are looking into the word lemma, not the word form. Besides, you have one capitalized lemma (Test/NNP) and several lower-case lemmas (test/JJ, test/NN, test/VB, test/VBP). With case_senstive="yes" you choose the capitalized lemma or the lower-case lemmas.

MikeUnwalla commented 7 years ago

Jaume, thanks for your comment. Although the behaviour is consistent, I still think that it is a bug.

The words 'template' and 'Template' are only NN, and 'templates' and 'Templates' are only NNS, but I get a similar problem.

<rule id="CASE_SENSITIVE_TEST_TEMPLATE" name="case-sensitive: template">
  <pattern>
    <token inflected="yes" case_sensitive="yes">template</token>
  </pattern>
  <message>Found: \1</message>
  <example correction="">If the <marker>template</marker> is not ...</example>
  <example type="triggers_error">Do not find <marker>Template</marker> because it has initial upper case.</example>
  <example type="triggers_error"><marker>Template</marker> should not be found.</example>
</rule>

The Development Overview (http://wiki.languagetool.org/development-overview) tells me:

If "attribute inflected="yes" means that you are looking into the word lemma, not the word form", then the information on Development Overview is not correct. The token 'Bicycle' is both NN and VB. The information about case_sensitive mentions nothing about lemmas or postags.

jaumeortola commented 7 years ago

The information on Development Overview is just a bit incomplete about using inflected & case_sensitive at the same time. It's understandable because this combination is rarely used.

Anyway, I don't know what are you trying to do. What do you need? In your last rule, if you use case_sensitive="no", the results will be the same.

If you need to exclude capitalized forms, you have to write something like this:

<rule id="CASE_SENSITIVE_TEST_TEMPLATE" name="case-sensitive: template">
  <pattern>
    <token inflected="yes">template<exception case_sensitive="yes" regexp="yes">Templates?</exception></token>
  </pattern>
  <message>Found: \1</message>
  <example correction="">If the <marker>template</marker> is not ...</example>
  <example>Do not find <marker>Template</marker> because it has initial upper case.</example>
  <example><marker>Template</marker> should not be found.</example>
</rule>
MikeUnwalla commented 7 years ago

Jaume, thanks for your alternative. (Yes, for one rule, I wanted to exclude a capitalized word.)

Given that the problem is caused by incomplete documentation, I will remove the 'bug' label.