IPA (International Phonetic Alphabet) voice recognition and text to speech

Kreijstal commented 5 years ago

Project description

[Describe the project the best you can. Give any background information or link to resources that are necessary to understand the problem it is intended to solve. The more you elaborate on your idea, the easier it is to accomplish.]

This is essentially the same as voice recognition except easier because programs will only listen for pronunciation without attempting to map it to a word.

Relevant Technology

[Write what technology is relevant. What language, what platform, any particular library/framework/existing project it is based on?]

Probably the same techniques used in traditional voice recognition

Complexity and required time

[Please only tick off one box in each category by changing [ ] to [x]. The labels on the project will then be updated by the maintainers as soon as possible.]

Complexity

[ ] Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
[ ] Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
[x] Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

[ ] Little work - A couple of days
[ ] Medium work - A week or two
[x] Much work - The project will take more than a couple of weeks and serious planning is required

straight-shoota commented 5 years ago

easier because programs will only listen for pronunciation without attempting to map it to a word.

That's hardly true. Speech recognition is easier when mapped to a specific language because knowing the linguistic features of a language plus the vocabulary of that language helps a lot with recognizing otherwise ambiguous phrases.

There are alone 34 different vowels sounds which would need to be differentiated. Making voice recognition understand the fine nuances in the phonetic alphabet without any guidelines of a language framework is most likely not solvable anytime in the near future.

Kreijstal commented 3 years ago

easier because programs will only listen for pronunciation without attempting to map it to a word.

That's hardly true. Speech recognition is easier when mapped to a specific language because knowing the linguistic features of a language plus the vocabulary of that language helps a lot with recognizing otherwise ambiguous phrases.

There are alone 34 different vowels sounds which would need to be differentiated. Making voice recognition understand the fine nuances in the phonetic alphabet without any guidelines of a language framework is most likely not solvable anytime in the near future.

I guess the idea is to make a 1 to 1 map of audio sounds you can make to symbols.

bastienboutonnet commented 3 years ago

Oooeh! Might be a fun thing to try during Christmas holiday! Do you have a use case in mind @Kreijstal ?

dshiryu commented 3 years ago

Oooeh! Might be a fun thing to try during Christmas holiday! Do you have a use case in mind @Kreijstal ?

Considering what he mentioned about the 1 to 1 map, since I have a (unused) Linguistics degree, I'd guess one possibility is taking a given audio recording and giving back the transcription. For instance: "The runner crossed the finishing line." returns /ðə ˈrʌnər krɔst ðə ˈfɪnɪʃɪŋˈlaɪn/ (probably without spaces without a dictionary implemented).

I'm guessing that the "easiest" way to tackle this is with AI, right? Basically take the sounds from https://www.ipachart.com/ as a based to know which symbols each sound uses, then add the training corpus and let it finish its magic (however that works, I'm still working on my CS degree lol)

Kreijstal commented 3 years ago

Oooeh! Might be a fun thing to try during Christmas holiday! Do you have a use case in mind @Kreijstal ?

Yes, basically, a language agnostic way of transcribing human made sounds. Regardless of language, you could then with the written symbols try to map what they say (in symbolic) form to how it is written. Or you could do the inverse and have a TTS that can pronounce everything out of the box, in any language, because it doesn't speak a language, it just speaks sounds. Even weird ones, or sounds that don't belong to any language. It'd be useful for singers trying to sing things in other languages, and for language learners to learn foreign pronunciation. If this is the case, then the transcription can be extremely verbose, considering pitch (because there are tonal languages), and length of the vowels, but if it's done, it won't have to guess which word was spoken, it just symbolizes what it listens. regardless if it's understandable or not.

Also, yes, it's quite likely that IPA doesn't really encode all possible sounds, we could either try to fix different pronunciations to IPA or do another way of tokenizing different pronunciations. Imo I think you would just do some sort of unspervised learning where the machine just learns to distinguish the different ways of saying something.

If I want to be clear, I'm not talking about tokenizing everything about a voice sample, otherwise that would be literally some sort of audio compression, I'm thinking about tokenizing everything a voice could do. So you don't have to worry if it's female or male, or who is speaking, but how it is pronounced. Like most of the things that you could pronounce.

KOLANICH commented 3 years ago

Text-to-Speech Synthesis Paul Taylor 978-0-521-89927-7 is highly relevant.

open-source-ideas / ideas