reazon-research / ReazonSpeech

Massive open Japanese speech corpus
https://research.reazon.jp/projects/ReazonSpeech/
Apache License 2.0
239 stars 18 forks source link

About HANKAKU and ZENKAKU substitution #3

Closed sejimak closed 1 year ago

sejimak commented 1 year ago

Thank you very much for your great work. After reviewing the source code below, I thought there was a concise way to write it in the HANKAKU to ZENKAKU conversion section. ReazonSpeech/reazonspeech/text.py

You define it as follows

_HAN2ZEN = str.maketrans(
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")

...
return text.translate(_SPECIALS).translate(_HAN2ZEN)

However, since espnet is required to use this tool, the dependent library jaconv should be installed. Therefore, it is believed that this code can be realized with the following

return jaconv.h2z(text, kana=True, digit=True, ascii=True)

I hope this is helpful.

fujimotos commented 1 year ago

Thank you for your suggestion!

After reviewing the source code below, I thought there was a concise way to write it in the HANKAKU to ZENKAKU conversion section.

Yes, normalization is definitely one of the areas where ReazonSpeech can improve.

We hand-crafted a vary basic normalization rule for the initial version. However, we are considering to revisit & expand the normalization rules.

We'll take your suggestion into the consideration. So thank you again for your feedback!