Improve entries for "Fonds" and compounds

gooofy commented 5 years ago

the compounds are great, thanks! however, the trouble with "Fonds" is:

(a) our dict lacks the 'ɔ̃' phoneme for proper pronounciation (b) "fonds" is both singular and plural form, however the plural form is pronounced with an 's' sound at the end.

svenha commented 5 years ago

(a) our dict lacks the 'ɔ̃' phoneme for proper pronounciation

Oh, yes. I picked another phoneme that was used in the lexicon before, but it should be your phoneme. Are there any plans to extend the German dictionary by frequent phonemes needed for foreign words? I stumbled across this problem because 'Fonds' appeared instead of quite different words ('von' I think).

(b) "fonds" is both singular and plural form, however the plural form is pronounced with an 's' sound at the end.

I was confused by the fact that the pronunciation before the 's' was simplified, as discussed for (a). So, we need both pronunciations, I guess?

gooofy commented 5 years ago

Are there any plans to extend the German dictionary by frequent phonemes needed for foreign words?

while I did not have any immediate plans (yet? ;) ) I am very open to the idea. In fact, we already have a few english phonemes (r, θ, ...). Also note that I do not speak french so I could really use some help here :)

The trouble with adding new phonemes is that we will probably have to support them across the different alphabets and tools we use. A good starting point to add more phonemes is probably nltools/phonetics.py - here you will find tables covering the different phoneme encodings we use:

IPA
XSAMPA: this is extended SAMPA - https://www.phon.ucl.ac.uk/home/sampa/french.htm for example has encodings for nasal vowels (e~ a~ o~ 9~) - not sure how common this encoding is, was just the first hat google came up with
MARY: this will take some source code reading and/or experimentation to figure out how MARY encodes french vowels
ESPEAK: just as with mary TTS we should investigate how eSpeak encodes nasal vowels
XARPABET: this is my own creation, mainly used for CMU sphinx training - here we can use whatever encoding we like as long as it is unique

I was confused by the fact that the pronunciation before the 's' was simplified, as discussed for (a). So, we need both pronunciations, I guess?

it would also require fixing all transcripts that use any of these words - but yes, this would be the right thing to do to fix this issue once and for all.

svenha commented 5 years ago

My French is a little bit rusty, but for the foreign words in German it should help. Fortunately, some non-English phonemes are shared by German (ü, ö) and French, so I agree that adding the four nasal vowels would be a solid step forward. I am not familiar with all 5 phoneme encodings, but I am willing to help with adjusting dict.ipa and/or transcripts.

What happens with the nasal vowels when your script extracts from Wiktionary? Maybe this script can be adjusted to provide a candidate list for dict.ipa?

svenha commented 5 years ago

I accidentally added new changes to this open PR (sorry, I will be more cautious for the next PR :-) ). The original changes for "Fonds" are reverted, so you can safely accept this PR if you like it.

pguyot commented 5 years ago

What is XSAMPA used for?

I am confused by the UCL webpage you link to as it is incoherent with other documents on the same website.

From what I understand, SAMPA and IPA encode nasal vowels with modifiers, and logically, as IPA's ɔ is encoded as O in SAMPA, ɔ̃ should be O~ and not o~.

Besides, the following documents write O~ https://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf https://www.phon.ucl.ac.uk/home/sampa/index.html

Likewise, I guess that: ɛ̃ probably should be E~ and not e~. ɑ̃ probably should be A~ and not a~. œ̃ as 9~ is coherent with IPA encoding.

Also, for the record, to cover French language, phonetics.py also lacks ɲ which is J in XSAMPA.

Concerning espeak, the Kirshenbaum transcription scheme also uses ~ suffix for nasalized vowels.

svenha commented 5 years ago

@pguyot Good to have a native speaker of French! Thanks for your suggestions. All make sense to me, but the ɲ might be too much because this discussion is about words from French in German. So, isn't nj a good approximation to how Germans speak this sound? Of course, if you want to start a dict-fr.ipa for French, there should be ɲ ...

gooofy commented 5 years ago

@svenha: absolutely agree we should keep the phoneme set small for the german dict - main issue is phoneme coverage here - if a phoneme is very rare in our recordings for a specific language we won't have enough training data on it so it might end up doing more harm than good.

@pguyot: having support for french in zamia-speech would be very cool! however, I won't be of much help here as I do not speak french. of course for a french dict you are free to use whatever phonemes you like - it is just the german dict that I would like to keep from using rare french phonemes, at least for now.

about the xsampa encoding: probably a good thing to go with whatever encoding is most common out there. would still be interesting to know what encoding MARY uses?

gooofy / zamia-speech

Improve entries for "Fonds" and compounds #45