rhasspy / gruut

A tokenizer, text cleaner, and phonemizer for many human languages.
MIT License
273 stars 36 forks source link

add slovak (sk) language #41

Open neurlang opened 10 months ago

neurlang commented 10 months ago

I would like to suggest adding the dataset.txt of 24865 slovak words, these are hand reviewed. What license would be preferrable to the gruut project? I am the author, can release it under any license you prefer.

https://github.com/neurlang/toipa/tree/master/sk2ipa

Fixes which would be needed:

  1. remove the ' character
  2. replace θ to c
  3. add spaces between phonemes
  4. remove words which map to the A / F placeholder

Then they would be loaded into the lexicon.db word_phonemes table.

What is g2p_alignments table for?

I can also generate a larger dictionary using the neural network (up to 300k words) but these could contain mistakes.