Open MikeUnwalla opened 2 years ago
Hi,
I've reproduced and investigated this problem.
I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line, which cannot be successfully recognized by current rules.
The following screenshots show the difference:
The text string using GUI:
The text string using command line:
I think the simplest way to solve this problem is replacing "\r" with "" somewhere in the code.
So I tried to replace "\r" with "" in method static List<String> tokenize(String text, SrxDocument srxDocument, String code)
in SrxTools.java.
After that, I re-built the project and found it works without bringing any new bugs.(it could pass all the tests using mvn clean test)
I'm wondering whether or not I can work on this issue and open a pull request to solve this problem? And if there is other better solutions to this issue, please let me know so that I could solve the problem in a better way.
I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line
Does it do that explicitly? I guess it's just the default on Windows?
I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line
Does it do that explicitly? I guess it's just the default on Windows?
Yes, you are right. So on Linux, this bug doesn't exist fundamentally.
I tried to make a disambiguation rule that applies a SENT_END and a SENT_START postag to text that contains a line feed (LF). I could not. To try to understand why I could not make a disambiguation rule, I made a grammar rule to find the end of a sentence:
<rule id="FIND_SENTENCE_END" name="Find the end of a sentence">
<pattern>
<marker>
<token regexp="yes">\u000A</token>
</marker>
</pattern>
<message>Found the end of a sentence.</message>
<example type="incorrect">The cat is on the mat.
<marker/></example></rule>
LF = U+000A (https://www.compart.com/en/unicode/U+000A).
When I run testrules, I get this message:
Checking regexp syntax of 5540 rules for English...
*** WARNING: The English rule: FIND_SENTENCE_END[1], token [1], contains "\u000A" that is marked as regular expression but probably is not one.
I found this message from 2012 that discusses a similar problem: Refer to https://forum.languagetool.org/t/searching-for-specific-unicode-characters/116
@SpaceIshtar wrote, "I'm wondering whether or not I can work on this issue and open a pull request." Yes please. I would be very grateful if you could find a solution.
@languagetool-org/developers. It would be really nice if LT let users find any Unicode characters that they want to find.
The different sentence splitting causes inconsistent analysis of text.
Example sentences:
The GUI shows 3 sentences:
If a sentence contains 2 instances of the word 'cat', this rule finds the first instance:
Put the example sentences into a file (
test-data.txt
) and use the command line to analyse the sentences. The results from the command line give a false warning:The sentence splitting is different in the command line analysis. There is only one sentence: