textlint-rule / sentence-splitter

Split {Japanese, English} text into sentences.
https://sentence-splitter.netlify.app/
MIT License
118 stars 14 forks source link

Sentences with "line indexes" split unexpectedly. #35

Closed gonnavis closed 1 year ago

gonnavis commented 1 year ago

Describe the bug

Sentences with "line indexes" split unexpectedly.

Text

For the fourth choice question on the JavaScript quiz, the choices are:

1. Throws an error
2. Ignores the statements
3. Gives a warning
4. None of the above

Feel free to let me know if you need help with the correct answer or if you have any other question!

Actual Result

For the fourth choice question on the JavaScript quiz, the choices are:\n\n1. ` Throws an error\n2. Ignores the statements\n3. Gives a warning\n4. None of the above\n\nFeel free to let me know if you need help with the correct answer or if you have any other question!`

Expected Result

For the fourth choice question on the JavaScript quiz, the choices are: 1. Throws an error 2. Ignores the statements 3. Gives a warning 4. None of the above Feel free to let me know if you need help with the correct answer or if you have any other question!

gonnavis commented 1 year ago

Found there are two reasons:

azu commented 1 year ago

/^\d+\. / appears to need to be interpreted as a syntax with special meaning. Some words, such as Mr., have already been processed so that they are not sentence-breakers there. I feel that 1. will need similar treatment.

https://github.com/textlint-rule/sentence-splitter/blob/fa8f67138d6702aba7f6e8032ac8fdd542226af1/src/parser/AbbrMarker.ts#L72-L121

📝 In markdown, 1. is list syntax. splitAST creates sentences from ASTs parsed from the Markdown parser, so there should be no problem. split treat plain text and cause this issue.

gonnavis commented 1 year ago

Hello @azu , I have another question that, why ignore \n totally?

https://sentence-splitter.netlify.app/#We%20are%20talking%20about%20pens%0AHe%20said%20%22This%20is%20a%20pen.%20I%20like%20it%22%0AI%20could%20relate%20to%20that%20statement.

Input, three lines of sentences without ending . but with ending line breaker \n:

We are talking about pens
He said "This is a pen. I like it"
I could relate to that statement.

Output, only one sentence:

We are talking about pens\nHe said \"This is a pen. I like it\"\nI could relate to that statement.

Is this repo only focusing on a single paragraph?

gonnavis commented 1 year ago

Oh, already parsed \n, but seems it has lower priority than seperator .

image

gonnavis commented 1 year ago

Achieved by call close with newline:

image

But still can't understand why not split newLine ( \n ) first by default. Is it in order to prevent split texts with wrongly returned sentence, such as

This is an
apple.

?

gonnavis commented 1 year ago

Hello @azu , made PR to not split line indexes https://github.com/textlint-rule/sentence-splitter/pull/36

azu commented 1 year ago

Thanks! I'll look it in weekend.