About Preprocessing in `Juman.apply_to_sentence`

ku-nlp / rhoknp

Yet another Python binding for Juman++/KNP/KWJA

https://rhoknp.readthedocs.io/en/latest/

MIT License

31 stars 3 forks source link

About Preprocessing in `Juman.apply_to_sentence` #121

Closed tealgreen0503 closed 1 year ago

tealgreen0503 commented 1 year ago

It appears that some preprocessing takes place when performing morphological analysis with Juman.apply_to_sentence. For example, half-width spaces are replaced with full-width spaces, and line breaks are removed.

import rhoknp
juman = rhoknp.Jumanpp()
text = " これは半角スペースです。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['\u3000', 'これ', 'は', '半角', 'スペース', 'です', '。']
text = "\nこれは改行です。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['これ', 'は', '改行', 'です', '。']

Are there other such preprocessings?

hkiyomaru commented 1 year ago

The preprocessing steps performed by Jumanpp.apply_to_sentence include:

Replacing half-width spaces with full-width spaces.
Replacing straight double quotation marks (") with curved double quotation marks (”).
Removing line breaks.
Removing carriage returns.

hkiyomaru commented 1 year ago

It's important to note that sentences beginning with # are considered comments and are not parsed. The Juman++ developer has proposed a workaround to address this, which can be found in this Github issue. It's worth mentioning that rhoknp does not perform this workaround as a pre-processing step. If you require this functionality, you will need to implement the workaround yourself.

hkiyomaru commented 1 year ago

Let's carry on this discussion on https://github.com/ku-nlp/jumanpp/discussions/154.

hkiyomaru commented 1 year ago

https://github.com/ku-nlp/rhoknp/pull/123 will fix the handling of half-width spaces and straight double quotation marks.