surrsurus / text-to-ipa

Text to IPA converter in JavaScript
GNU General Public License v3.0
52 stars 15 forks source link

[Enhancement] Support for other languages #1

Open loretoparisi opened 6 years ago

loretoparisi commented 6 years ago

Hi, what about the support for a different char set in other languages? Since the current dictionary is the CMU dictionary, to get it working to a new language, I suppose one would need the top common words list in that language and the IPA transcription rules for that language, right?

surrsurus commented 6 years ago

Yep, that's correct. Finding the CMU dictionary was the first step and I haven't done research into dictionaries of non-english libraries as of yet, however it should be just as simple to replace the CMU/add to the CMU with any dictionary so long as it follows the same format.

loretoparisi commented 6 years ago

@surrsurus thanks a lot. I will try to find out the API transcriptions at least for european languages.

loretoparisi commented 6 years ago

@surrsurus another question. What about words not contained in the dictionary? The CMU has the LOGIOS Lexicon Tool (that is in perl) that generates a pronunciation dictionary to come up with pronunciations for words that are not in the current dictionary. Since it was made for speech research, the output is more likely for the sphinx cmu tool, but the interesting part is the implementation here http://svn.code.sf.net/p/cmusphinx/code/trunk/logios/scripts/Logios.pm

surrsurus commented 6 years ago

A procedural tool for generating IPA symbols from words would be useful, thought it doesn't appear that the LOGIOS tool puts it into the proper format. Maybe it could be changed so that it is in the proper format.

loretoparisi commented 6 years ago

@surrsurus yes you are right, in fact I was stuck for that since looking at the perl code it outputs for sphinx text to speech tool rather than a standard cmu dictionary output! I will look forward to find out some useful and update here!

loretoparisi commented 6 years ago

@surrsurus Hello, I was trying to better understand the IPA format and its CMU relation. Now I can see that the CMU Dictionary has a split of the symbols with whitespaces like:

a AH0
a(2) EY1
a's EY1 Z
a. EY1
a.'s EY1 Z
a.d. EY2 D IY1
a.m. EY2 EH1 M
a.s EY1 Z
aaa T R IH2 P AH0 L EY1

while in the IPA dict I have

a ʌ
a(1) ejˈ
a's ejˈz
a. ejˈ
a.'s ejˈz
a.s ejˈz

To map the phonemes, supposed I would like to keep the whitespaces between the symbols, how could I do? This would help to parse the symbols in the dictionary by a character based neural network to do a training.

surrsurus commented 6 years ago

The way it is now is that whitespace separates the text from the IPA text. The whitespace is then stripped and turns the string into an array, so then the first element is going to be the regular text and whatever is left is the IPA text, maybe a similar approach could work for you.

loretoparisi commented 6 years ago

@surrsurus thank you. In fact I'm trying that approach simply doing for both grapheme and phoneme like

phonemes.append(list(split_line[1]))

I will update if it works as expected.

dohliam commented 6 years ago

@surrsurus @loretoparisi You might be interested in the ipa-dict project, which already provides IPA dictionaries in 19 languages. You could always simply adapt those dictionaries for use here. The format is the essentially the same as ipadict.txt here.

There is also a web interface called ipa-lookup for looking up IPA pronunciation based on the ipa-dict data. Text-to-ipa seems like it's faster (I assume because it doesn't load the whole dictionary into memory), so it would be interesting to try it out with some of the different language files.

surrsurus commented 6 years ago

Looks very interesting, could definitely work considering all the files seem to be in the exact same format of ipadict.txt. And something should definitely be done about loading the file, the entire thing really shouldn't be loaded at once so now would be a good time to fix that along with implementing new languages.