Open loretoparisi opened 6 years ago
It would be great to add Italian -- do you know of any sources for such a dictionary?
If there isn't something already available under an open license, it might be possible to generate a dictionary using a script. In that case, we would need to have both a word list (something like Aspell would probably be fine) and a list of rules for representing Italian orthography in IPA. The script would then apply these rules on the word list to generate the dictionary.
The script option described above is only really practical if there is a reasonably consistent correspondence between the orthography and pronunciation. My impression is that this is the case with Standard Italian, so it might be worth a try if nothing else is available.
@dohliam thanks! I will have a look to find a good dictionary for that.
For the spelling part in IT there are the hunspell dictionaries here: https://github.com/loretoparisi/dictionaries adapted to HunSpell from LibreOffice dictionaries: https://github.com/LibreOffice/dictionaries
that have dictionaries for
while CMUSphinx is a good source for the phonetics dictionaries: https://cmusphinx.github.io/ and they have the Italian Phonetics Dictionary (used to build a Grapheme to Phoneme prediction as well) in the downloads: https://cmusphinx.github.io/wiki/download/ and here: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Italian/
as well as other languages. A problem that I can see is that the encoding of the IPA symbols it's not clear to me:
celebrare tSS e l e b r a1 r e
celebrato tSS e l e b r a1 t o
celebravano tSS e l e b r a1 v a n o
celeste tSS e l EE s t e
celestiale tSS e l e s t j a1 l e
celestiali tSS e l e s t j a1 l i
celesti tSS e l EE s t i
celio tSS EE l j o
celi tSS EE l i
cella tSS EE l l a
cenare tSS e n a1 r e
cenarono tSS e n a1 r o n o
cenato tSS e n a1 t o
@loretoparisi Fantastic! Thanks for finding all this info. I wasn't aware that there were CMU dictionaries for other languages. The transcription format is indeed a little odd, but luckily it's also fairly familiar since I already converted the en_US
dictionary from CMU format before.
I've written a quick script (here) to convert the Italian CMU dictionary into IPA. You can see the result of this conversion here.
There are still some remaining issues -- notably the sound they transcribe as nf
is highly questionable since it sometimes seems to correspond to ŋf
(e.g., trionfo), sometimes to nv
(e.g., circonvicini) and sometimes to nf
(e.g., conferma). These may need to be manually fixed.
I've managed to extract the primary stress markers out of the data, which is useful, but because they place stress on the vowel and don't indicate syllable boundaries, it's very difficult to position these correctly at the beginning of the syllable in the resulting IPA. So for example, città is converted to /tʃittˈa/
rather than /tʃitˈta/
because we would need some way for the script to know that the syllable should be split between the two consonants. These will have to be adjusted by adding syllable parsing rules to the script (or manually).
The provided CMU dictionary is a little small unfortunately -- only 7109 entries. It's a good start, but it would be much better if we could parse the Hunspell / Aspell word list instead. Do you have any experience with using CMU Sphinx to generate phonetic output? If so, we could use my script to convert the result to IPA.
@dohliam You are welcome, as you said it's a good start! I think it's a good idea to use CMU Sphinx directly to generate a phonetic output using the model provided for the italian (that is the file it.fst
), this should handle the problem of out of vocabulary words. Let me have a look at the model. Of course since the training was done on a small dictionary (the 7109 entries) we could also have false positive in the output, but this is something we should check later on.
By the way according to the it
model readme for this model we have:
EVALUATION RESULTS
----------------------------------------------------------------------
(T)otal tokens in reference: 2528
(M)atches: 2404 (S)ubstitutions: 122 (I)nsertions: 0 (D)eletions: 2
% Correct (M/T) -- %95.09
% Token ER ((S+I+D)/T) -- %4.91
% Accuracy 1.0-ER -- %95.09
--------------------------------------------------------
(S)equences: 357 (C)orrect sequences: 257 (E)rror sequences: 100
% Sequence ER (E/S) -- %28.01
% Sequence Acc (1.0-E/S) -- %71.99
######################################################################
I will try to run the model over the Hunspell dict and we will se how accuracy goes on the test set. I have put the stuff here as well: https://github.com/loretoparisi/ipa-phonetics-dict/blob/master/it/README
Starting from the new work of CMU guys I have also worked on a Tensorflow G2P model to take in account out of vocabulary words and have a Neural Network model for that. This is the docker I'm using for that:
https://github.com/loretoparisi/docker/tree/master/g2p-seq2seq
This is a work in progress, and it should replace the current CMU models in the next, so it will work for italian too.
@loretoparisi That's amazing! Sounds like it could be a much better approach, and it will be interesting to see how accurate the results are on the Hunspell list. In the meantime I'll see what I can do about the syllabification issue -- hopefully there are enough clear rules about what constitutes a syllable that we can automate the conversion of stress markers in the final result.
@loretoparisi Just checking in... Have you had any progress with this so far? It would be great to add Italian to the database once it's ready! :smile:
@dohliam I have basically used this one https://github.com/loretoparisi/ipa-phonetics-dict/tree/master/it For the spelling accuracy I have to go back since I did times ago. I will update.
@loretoparisi Excellent, thanks! :+1: I have this version from before but will wait for the update to convert it and add to the database.
I've just discovered your project. Any further progress on adding Italian to the database?
@doolio The links above are the latest progress I am aware of with the Italian IPA list. In case you would like to try working with something in the meantime, there are two options: this list which is not very large and has been auto-generated based on the Italian CMU dictionary, and this one which attempts to use a G2P approach to handle out-of-vocabulary words. Neither of these has been manually checked for errors, though, which is why there is currently nothing for Italian yet in the main repo here. All contributions welcome! :smile:
@loretoparisi Have you had the chance to take a look at this recently? It would be great to add Italian to the project if possible.
Thanks. Yes, I had a look at the first list already as it was linked earlier in this discussion. You seemed to have forgotten the link to the second list. I'm trying to learn Italian and in doing so if I improve these lists I will of course contribute them back here.
@doolio Fixed the link, but that page is also linked earlier in this discussion, so you may have seen it already. In both cases, the output needs to be checked by someone to make sure there are no glaring errors in the transcription. My sense is that Italian orthography might be regular enough that there would likely not be more than a few outliers or exceptions for a rule-based transcription, but it would be nice if someone could confirm that and correct the output if needed.
Any plans to add Italian IPA dict? Thanks.