textlint-rule / sentence-splitter

Split {Japanese, English} text into sentences.
https://sentence-splitter.netlify.app/
MIT License
118 stars 14 forks source link

~~Nesting Sentences Support~~ → Add Context for nested sentence #27

Closed azu closed 1 year ago

azu commented 2 years ago
He said "This is a pen. That is not a pen.".

Currently, sentence-splitter parse the text as follows:

image

We want to support nesting sentences.

image

PairMarker is related https://github.com/azu/sentence-splitter/blob/master/src/parser/PairMaker.ts

AST Design

[
  {
    "type": "Sentence",
    "raw": "He said 「This is a pen. That is not a pen」.",
    "children": [
      {
        "type": "Str",
        "raw": "He said ",
        "value": "He said "
      },
      {
        "type": "PairMark",
        "pairType": "start",
        "raw": "「 ",
        "value": "「 "
      },
      {
        "type": "Sentence",
        "raw": "This is a pen.",
        "children": [
          {
            "type": "Str",
            "raw": "This is a pen.",
            "value": "This is a pen. "
          }
        ]
      },
      {
        "type": "Sentence",
        "raw": "That is not a pen.",
        "children": [
          {
            "type": "Str",
            "raw": "That is not a pen",
            "value": "That is not a pen"
          }
        ]
      },
      {
        "type": "PairMark",
        "pairType": "end",
        "raw": "」",
        "value": "」 "
      },
      {
        "type": "Punctuation",
        "raw": ".",
        "value": "."
      }
    ]
}

Related

azu commented 2 years ago

I noticed that some case is difficult.

彼は「コレ」と読んだ

I think that 「コレ」 is not a sentence. sentence-splitter can not detect it.

azu commented 1 year ago

It should be opt-in feature.

We can not detect which is better.

azu commented 1 year ago

Probably, These are rule implemetation bug. Not this library

azu commented 1 year ago
We are talking about pens.
He said "This is a pen. I like it".
I could relate to that statement.

Current parser parse it following sentences.

image

Second sentence has "This is a pen. I like it", but we can not split it into new sentence.

The conversation text is just Str node. HTML does not have suitable semantics for conversation.

As a result, sentence-splitter can not support nesting sentence. Probably, rule implementation should handble the quote text after parsing sentences by sentence-splitter.

We will close this issue by adding current behavior.