Closed tealgreen0503 closed 1 year ago
The preprocessing steps performed by Jumanpp.apply_to_sentence
include:
It's important to note that sentences beginning with #
are considered comments and are not parsed. The Juman++ developer has proposed a workaround to address this, which can be found in this Github issue. It's worth mentioning that rhoknp does not perform this workaround as a pre-processing step. If you require this functionality, you will need to implement the workaround yourself.
Let's carry on this discussion on https://github.com/ku-nlp/jumanpp/discussions/154.
https://github.com/ku-nlp/rhoknp/pull/123 will fix the handling of half-width spaces and straight double quotation marks.
It appears that some preprocessing takes place when performing morphological analysis with
Juman.apply_to_sentence
. For example, half-width spaces are replaced with full-width spaces, and line breaks are removed.Are there other such preprocessings?