languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.38k stars 1.39k forks source link

[en] Different sentence splitting in the GUI and the command line #6318

Open MikeUnwalla opened 2 years ago

MikeUnwalla commented 2 years ago

The different sentence splitting causes inconsistent analysis of text.

Example sentences:

The cat sat on the mat.
The dog ate a bone.
Then another cat sat on the mat.

The GUI shows 3 sentences: image

If a sentence contains 2 instances of the word 'cat', this rule finds the first instance:

      <rule id="TEST_SENTENCE_SPLIT" name="test sentence splitting">
        <pattern>
          <marker>
            <token skip="-1">cat</token>
          </marker>
          <token>cat</token>
        </pattern>
        <message>Found the first cat.</message>
        <example type="incorrect">My <marker>cat</marker> and your cat both sat on a mat.</example>
        <example type="correct">Then another cat sat on the mat.</example>
      </rule>

Put the example sentences into a file (test-data.txt) and use the command line to analyse the sentences. The results from the command line give a false warning:

D:\LanguageTool-5.7-SNAPSHOT>java -jar languagetool-commandline.jar -l en-US -eo -e TEST_SENTENCE_SPLIT test-data.txt
Expected text language: English (US) (no spell checking active)
Working on test-data.txt...
1.) Line 1, column 5, Rule ID: TEST_SENTENCE_SPLIT[1]
Message: Found the first cat.
?The cat sat on the mat.  The dog ate a bone.  Then a...
     ^^^
Time: 1638ms for 1 sentences (0.6 sentences/sec)

The sentence splitting is different in the command line analysis. There is only one sentence:

D:\LanguageTool-5.7-SNAPSHOT>java -jar languagetool-commandline.jar -l en-US -t -eo -e TEST_SENTENCE_SPLIT test-data.txt
Expected text language: English (US) (no spell checking active)
Working on test-data.txt...
<S>  The[the/DT] cat[cat/NN,E-NP-singular] sat[sit/VBD,B-VP] on[on/IN,on/JJ,on/RP,B-PP] the[the/DT,B-NP-singular] mat[mat/JJ,mat/NN,E-NP-singular].[./.,./PCT,O]  The[the/DT,B-NP-singular] dog[dog/NN,E-NP-singular] ate[ate/NN,eat/VBD,B-VP] a[a/DT,B-NP-singular] bone[bone/NN:UN,E-NP-singular].[./.,./PCT,O]  My[I/PRP$_A1S,my/PRP$,B-NP-singular] cat[cat/NN,E-NP-singular] likes[like/NNS,like/VBZ,B-VP] your[you/PRP$_A2S,you/PRP$_A2P,your/PRP$,B-NP-singular] dog[dog/NN,E-NP-singular].[./.,</S>./PCT,O]
SpaceIshtar commented 2 years ago

Hi, I've reproduced and investigated this problem. I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line, which cannot be successfully recognized by current rules. The following screenshots show the difference: The text string using GUI: image The text string using command line: 1647702256(1) I think the simplest way to solve this problem is replacing "\r" with "" somewhere in the code. So I tried to replace "\r" with "" in method static List<String> tokenize(String text, SrxDocument srxDocument, String code) in SrxTools.java.

1647702561(1)

After that, I re-built the project and found it works without bringing any new bugs.(it could pass all the tests using mvn clean test)

1647702700(1)

I'm wondering whether or not I can work on this issue and open a pull request to solve this problem? And if there is other better solutions to this issue, please let me know so that I could solve the problem in a better way.

danielnaber commented 2 years ago

I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line

Does it do that explicitly? I guess it's just the default on Windows?

SpaceIshtar commented 2 years ago

I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line

Does it do that explicitly? I guess it's just the default on Windows?

Yes, you are right. So on Linux, this bug doesn't exist fundamentally.

1647741741(1)
MikeUnwalla commented 2 years ago

I tried to make a disambiguation rule that applies a SENT_END and a SENT_START postag to text that contains a line feed (LF). I could not. To try to understand why I could not make a disambiguation rule, I made a grammar rule to find the end of a sentence:

    <rule id="FIND_SENTENCE_END" name="Find the end of a sentence">
      <pattern>
        <marker>
          <token regexp="yes">\u000A</token>
        </marker>
      </pattern>
        <message>Found the end of a sentence.</message>
        <example type="incorrect">The cat is on the mat.
<marker/></example></rule>

LF = U+000A (https://www.compart.com/en/unicode/U+000A).

image

When I run testrules, I get this message:

Checking regexp syntax of 5540 rules for English...
*** WARNING: The English rule: FIND_SENTENCE_END[1], token [1], contains "\u000A" that is marked as regular expression but probably is not one.

I found this message from 2012 that discusses a similar problem: Refer to https://forum.languagetool.org/t/searching-for-specific-unicode-characters/116

@SpaceIshtar wrote, "I'm wondering whether or not I can work on this issue and open a pull request." Yes please. I would be very grateful if you could find a solution.

@languagetool-org/developers. It would be really nice if LT let users find any Unicode characters that they want to find.