megagonlabs / ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies
MIT License
738 stars 56 forks source link

1+ newline characters in input string become 1 white space in doc.text. #113

Closed TomokiMatsuno closed 4 years ago

TomokiMatsuno commented 4 years ago

When performing sequence labeling on a document with multiple paragraphs, I've found inconsistency with regard to newline characters between doc.text of en_core_web_sm (version: 2.2.5) and ja_ginza (version: 3.1.0).

Like the examples below, 1+ newline characters in input string become 1 white space in doc.text in ja_ginza while they remain the same in en_core_web_sm.

This makes it difficult to label a document maintaining its paragraph structure.

Snippet for loading a model and parsing input text

nlp = spacy.load(model_file)
doc = nlp(input_string)

Input string and doc.text

model: en_core_web_sm (version: 2.2.5)

input string: This is,\nTokyo.
doc.text: This is,\nTokyo.
input string: This is,\n\nTokyo.
doc.text: This is,\n\nTokyo.

model: ja_ginza (version: 3.1.0)

input string: これが、\n東京。
doc.text: これが、\s東京。
input string: これが\n\n東京。
doc.text: これが\s東京。
input string: これが\n\n\n東京。
doc.text: これが\s東京。
hiroshi-matsuda-rit commented 4 years ago

Thank you for reporting this suspicious behavior. @TomokiMatsuno I'd like to revise that in next major version after spaCy v2.3.1 released.

hiroshi-matsuda-rit commented 4 years ago

Sorry for late. I tested these input strings with GiNZA v4 and found fixed. Thank you again! @TomokiMatsuno