gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
444 stars 84 forks source link

Improve entries for "Fonds" and compounds #45

Closed svenha closed 5 years ago

gooofy commented 5 years ago

the compounds are great, thanks! however, the trouble with "Fonds" is:

(a) our dict lacks the 'ɔ̃' phoneme for proper pronounciation (b) "fonds" is both singular and plural form, however the plural form is pronounced with an 's' sound at the end.

svenha commented 5 years ago

(a) our dict lacks the 'ɔ̃' phoneme for proper pronounciation

Oh, yes. I picked another phoneme that was used in the lexicon before, but it should be your phoneme. Are there any plans to extend the German dictionary by frequent phonemes needed for foreign words? I stumbled across this problem because 'Fonds' appeared instead of quite different words ('von' I think).

(b) "fonds" is both singular and plural form, however the plural form is pronounced with an 's' sound at the end.

I was confused by the fact that the pronunciation before the 's' was simplified, as discussed for (a). So, we need both pronunciations, I guess?

gooofy commented 5 years ago

Are there any plans to extend the German dictionary by frequent phonemes needed for foreign words?

while I did not have any immediate plans (yet? ;) ) I am very open to the idea. In fact, we already have a few english phonemes (r, θ, ...). Also note that I do not speak french so I could really use some help here :)

The trouble with adding new phonemes is that we will probably have to support them across the different alphabets and tools we use. A good starting point to add more phonemes is probably nltools/phonetics.py - here you will find tables covering the different phoneme encodings we use:

I was confused by the fact that the pronunciation before the 's' was simplified, as discussed for (a). So, we need both pronunciations, I guess?

it would also require fixing all transcripts that use any of these words - but yes, this would be the right thing to do to fix this issue once and for all.

svenha commented 5 years ago

My French is a little bit rusty, but for the foreign words in German it should help. Fortunately, some non-English phonemes are shared by German (ü, ö) and French, so I agree that adding the four nasal vowels would be a solid step forward. I am not familiar with all 5 phoneme encodings, but I am willing to help with adjusting dict.ipa and/or transcripts.

What happens with the nasal vowels when your script extracts from Wiktionary? Maybe this script can be adjusted to provide a candidate list for dict.ipa?

svenha commented 5 years ago

I accidentally added new changes to this open PR (sorry, I will be more cautious for the next PR :-) ). The original changes for "Fonds" are reverted, so you can safely accept this PR if you like it.

pguyot commented 5 years ago

What is XSAMPA used for?

I am confused by the UCL webpage you link to as it is incoherent with other documents on the same website.

From what I understand, SAMPA and IPA encode nasal vowels with modifiers, and logically, as IPA's ɔ is encoded as O in SAMPA, ɔ̃ should be O~ and not o~.

Besides, the following documents write O~ https://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf https://www.phon.ucl.ac.uk/home/sampa/index.html

Likewise, I guess that: ɛ̃ probably should be E~ and not e~. ɑ̃ probably should be A~ and not a~. œ̃ as 9~ is coherent with IPA encoding.

Also, for the record, to cover French language, phonetics.py also lacks ɲ which is J in XSAMPA.

Concerning espeak, the Kirshenbaum transcription scheme also uses ~ suffix for nasalized vowels.

svenha commented 5 years ago

@pguyot Good to have a native speaker of French! Thanks for your suggestions. All make sense to me, but the ɲ might be too much because this discussion is about words from French in German. So, isn't nj a good approximation to how Germans speak this sound? Of course, if you want to start a dict-fr.ipa for French, there should be ɲ ...

gooofy commented 5 years ago

@svenha: absolutely agree we should keep the phoneme set small for the german dict - main issue is phoneme coverage here - if a phoneme is very rare in our recordings for a specific language we won't have enough training data on it so it might end up doing more harm than good.

@pguyot: having support for french in zamia-speech would be very cool! however, I won't be of much help here as I do not speak french. of course for a french dict you are free to use whatever phonemes you like - it is just the german dict that I would like to keep from using rare french phonemes, at least for now.

about the xsampa encoding: probably a good thing to go with whatever encoding is most common out there. would still be interesting to know what encoding MARY uses?