rhasspy / gruut

A tokenizer, text cleaner, and phonemizer for many human languages.
MIT License
273 stars 36 forks source link

Assertion errors for persian and swahili languages #6

Closed mbarnig closed 2 years ago

mbarnig commented 3 years ago

I installed gruut version 1.2.1 on my Desktop PC with pip install gruut[fr,it,de,pt,de,sv,cs,es,nl,ru,fa,sw] to test all supported languages, by running the following script :

 echo '<northwind and sun fable in specified language>' \  
    | python3 -m gruut <language> tokenize \
    | python3 -m gruut <language> phonemize \
    | jq -c .pronunciation_text

It works as expected for == en, en-us, de, de-de, sv, sv-se, pt, pt-br, it, it-it, nl, es, es-es, cs, cs-cz, ru, ru-ru.

For == fa (persian) an assertion error in line 118 of lang.py is issued. Here is the related log :

(rhasspy-gruut) mbarnig@mbarnig-MS-7B22:~/rhasspy-gruut$ echo 'باد شمال و خورشید داشتن سر اینکه کدوم قوی‌تر هستند بحث می‌کردن که یک‌دفعه یه مسافر که خودش رو در بالاپوش گرمی پوشونده بود پیداش شد. قرار گذاشتن که هر کدوم که بتونه اوّل مسافر رو مجبور به در آوردن بالاپوشش بکنه قوی‌تر از اون‌یکیه. بعد باد شمال به شدیدترین صورتی که می‌تونست شروع به وزیدن کرد، ولی هرچقدر سخت‌تر می‌وزید، مسافر بالاپوش رو محکم‌تر به دور خودش می‌پیچید. در آخر، باد شمال پشیمون شد و دست برداشت. بعد، خورشید شروع کرد به گرمی تابیدن، و مسافر بلافاصله بالاپوشش رو در آورد. به همین خاطر، باد شمال مجبور شد اعتراف کنه که بین اونها، خورشید قوی‌تره.' \
>     | python3 -m gruut fa tokenize \
>     | python3 -m gruut fa phonemize \
>     | jq -c .pronunciation_text 
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 261, in <module>
    main()
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 54, in main
    args.func(args)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 114, in do_phonemize
    phonemizer = get_phonemizer(
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/lang.py", line 184, in get_phonemizer
    assert lang_dir is not None
AssertionError
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 261, in <module>
    main()
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 54, in main
    args.func(args)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 69, in do_tokenize
    tokenizer = get_tokenizer(
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/lang.py", line 118, in get_tokenizer
    assert lang_dir is not None
AssertionError

For == sw (swahili) an assertion error in line 184 of lang.py is issued. Here is the related log :

(rhasspy-gruut) mbarnig@mbarnig-MS-7B22:~/rhasspy-gruut$ echo 'Kaskazini Upepo na jua wali kuwa wana shindana gani iko na nguvuu kushinda mwingine, msafiri aka kuja na alikuwa anavaa koti mzito. Wali kubaliana mtu ya kwanza kutoa koti ya msafiri ndio akona nguvu kushinda ingine. Upepo ya kaskazini ika jaribu kupiga upepo yake yote, lakini akaona vigumu yake inapiga, zaidi msafiri anafunga koti yake karibu naye, mpaka upepo ya kaskazini ikajishinda. Jua ikaanza ku ngua, mpaka msafiri akatoa koti yake mara moja. Sasa Upepo ya Kaskazini ika kubali jua ikona nguvu kuishinda.' \
>     | python3 -m gruut sw tokenize \
>     | python3 -m gruut sw phonemize \
>     | jq -c .pronunciation_text  
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 261, in <module>
    main()
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 54, in main
    args.func(args)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 69, in do_tokenize
    tokenizer = get_tokenizer(
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/lang.py", line 146, in get_tokenizer
    assert lang_dir is not None
AssertionError
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 261, in <module>
    main()
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 54, in main
    args.func(args)
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/__main__.py", line 114, in do_phonemize
    phonemizer = get_phonemizer(
  File "/home/mbarnig/rhasspy-gruut/lib/python3.8/site-packages/gruut/lang.py", line 184, in get_phonemizer
    assert lang_dir is not None
AssertionError
synesthesiam commented 3 years ago

Thanks! I had forgotten to include "fa" and "sw" in the setup.py language list. Should be fixed in 1.2.2.

mbarnig commented 3 years ago

I cloned the latest gruut version with the modified setup-py file from Github, installed it with pip install . and downloaded the persian and swahili languages with

wget https://github.com/rhasspy/gruut/releases/download/v0.9.0/fa.tar.gz
wget https://github.com/rhasspy/gruut/releases/download/v1.2.0/sw.tar.gz

inside the folders $HOME/.config/gruut/fa and $HOME/.config/gruut/sw.

The persian language files are extracted as expected:

mbarnig@mbarnig-MS-7B22:~/.config/gruut/fa$ tar -xvf fa.tar.gz
g2p.fst
language.yml
lexicon.db
phonemes.txt
postagger.model

The swahili language files are archived inside the sw folder:

mbarnig@mbarnig-MS-7B22:~/.config/gruut/sw$ tar -xvf sw.tar.gz
sw/
sw/__init__.py
sw/lexicon.db
sw/espeak/
sw/espeak/lexicon.db
sw/espeak/g2p/
sw/espeak/g2p/model.crf
sw/VERSION
sw/g2p/
sw/g2p/model.crf

I moved back one step to extract the content to the correct folder-level. Now my test with the Swahili language is working:

(rhasspy-gruut) mbarnig@mbarnig-MS-7B22:~/rhasspy-gruut/gruut$ echo 'Kaskazini Upepo na jua wali kuwa wana shindana gani iko na nguvuu kushinda mwingine, msafiri aka kuja na alikuwa anavaa koti mzito. Wali kubaliana mtu ya kwanza kutoa koti ya msafiri ndio akona nguvu kushinda ingine. Upepo ya kaskazini ika jaribu kupiga upepo yake yote, lakini akaona vigumu yake inapiga, zaidi msafiri anafunga koti yake karibu naye, mpaka upepo ya kaskazini ikajishinda. Jua ikaanza ku ngua, mpaka msafiri akatoa koti yake mara moja. Sasa Upepo ya Kaskazini ika kubali jua ikona nguvu kuishinda.' \
> | python3 -m gruut sw tokenize \
> | python3 -m gruut sw phonemize \
> | jq -c .pronunciation_text 
"k ɑ s k ɑ z i n i u p ɛ p ɔ n ɑ ʄ u ɑ w ɑ l i k u w ɑ w ɑ n ɑ ʃ i ⁿɗ ɑ n ɑ ɠ ɑ n i i k ɔ n ɑ ᵑg u v u u k u ʃ i ⁿɗ ɑ m w i ᵑg i n ɛ | m s ɑ f i ɾ i ɑ k ɑ k u ʄ ɑ n ɑ ɑ l i k u w ɑ ɑ n ɑ v ɑ ɑ k ɔ t i m z i t ɔ ‖ w ɑ l i k u ɓ ɑ l i ɑ n ɑ m t u j ɑ k w ɑ ⁿz ɑ k u t ɔ ɑ k ɔ t i j ɑ m s ɑ f i ɾ i ⁿɗ i ɔ ɑ k ɔ n ɑ ᵑg u v u k u ʃ i ⁿɗ ɑ i ᵑg i n ɛ ‖ u p ɛ p ɔ j ɑ k ɑ s k ɑ z i n i i k ɑ ʄ ɑ ɾ i ɓ u k u p i ɠ ɑ u p ɛ p ɔ j ɑ k ɛ j ɔ t ɛ | l ɑ k i n i ɑ k ɑ ɔ n ɑ v i ɠ u m u j ɑ k ɛ i n ɑ p i ɠ ɑ | z ɑ i ɗ i m s ɑ f i ɾ i ɑ n ɑ f u ᵑg ɑ k ɔ t i j ɑ k ɛ k ɑ ɾ i ɓ u n ɑ j ɛ | m p ɑ k ɑ u p ɛ p ɔ j ɑ k ɑ s k ɑ z i n i i k ɑ ʄ i ʃ i ⁿɗ ɑ ‖ ʄ u ɑ i k ɑ ɑ ⁿz ɑ k u ᵑg u ɑ | m p ɑ k ɑ m s ɑ f i ɾ i ɑ k ɑ t ɔ ɑ k ɔ t i j ɑ k ɛ m ɑ ɾ ɑ m ɔ ʄ ɑ ‖ s ɑ s ɑ u p ɛ p ɔ j ɑ k ɑ s k ɑ z i n i i k ɑ k u ɓ ɑ l i ʄ u ɑ i k ɔ n ɑ ᵑg u v u k u i ʃ i ⁿɗ ɑ ‖"

There is however still a problem with the persian language. At the first run I received a warning about the installation of hazm>=0.7.0 and the following error :

UnboundLocalError: local variable 'hazm' referenced before assignment. I installed hazmand now the test with the persian language is working:

(rhasspy-gruut) mbarnig@mbarnig-MS-7B22:~/rhasspy-gruut/gruut$ echo 'باد شمال و خورشید داشتن سر اینکه کدوم قوی‌تر هستند بحث می‌کردن که یک‌دفعه یه مسافر که خودش رو در بالاپوش گرمی پوشونده بود پیداش شد. قرار گذاشتن که هر کدوم که بتونه اوّل مسافر رو مجبور به در آوردن بالاپوشش بکنه قوی‌تر از اون‌یکیه. بعد باد شمال به شدیدترین صورتی که می‌تونست شروع به وزیدن کرد، ولی هرچقدر سخت‌تر می‌وزید، مسافر بالاپوش رو محکم‌تر به دور خودش می‌پیچید. در آخر، باد شمال پشیمون شد و دست برداشت. بعد، خورشید شروع کرد به گرمی تابیدن، و مسافر بلافاصله بالاپوشش رو در آورد. به همین خاطر، باد شمال مجبور شد اعتراف کنه که بین اونها، خورشید قوی‌تره.' \
> | python3 -m gruut fa tokenize \
> | python3 -m gruut fa phonemize \
> | jq -c .pronunciation_text 
" ʃ o m ɒː l v æ  d ɒː ʃ t æ n e̞ s æ ɾ e̞ iː n k e̞   h æ s t æ n d b æ h s  k e̞    k e̞ x o d æ ʃ ɾ uː d æ ɾ    b uː d  ʃ o d ‖ ɢ æ ɾ ɒː ɾ  k e̞ h æ ɾ  k e̞    ɾ uː m æ d͡ʒ b uː ɾ b e̞ d æ ɾ ɒː v æ ɾ d æ n e̞    æ z  ‖ b æ ʔ d  ʃ o m ɒː l b e̞  s uː ɾ æ t iː k e̞  ʃ o ɾ uː ʔ b e̞  k o ɾ d | v æ l iː    |   ɾ uː  b e̞ d uː ɾ e̞ x o d æ ʃ  ‖ d æ ɾ ɒː x æ ɾ |  ʃ o m ɒː l  ʃ o d v æ d æ s t b æ ɾ d ɒː ʃ t ‖ b æ ʔ d |  ʃ o ɾ uː ʔ k o ɾ d b e̞   | v æ    ɾ uː d æ ɾ ɒː v æ ɾ æ d ‖ b e̞ h æ m iː n x ɒː t e̞ ɾ |  ʃ o m ɒː l m æ d͡ʒ b uː ɾ ʃ o d   k e̞ b e̞ j n  |   ‖"

As I don't understand both languages I can't check if the phonemization is correct. :smile: :laughing: :smiley: