jtrmal / kaldi2020

Apache License 2.0
27 stars 3 forks source link

wish list #10

Open petercwallis opened 3 years ago

petercwallis commented 3 years ago

Have almost used kaldi a few times for dialog systems. Have used VoiceXML and MRCP, Alexa Skills Kit, Google Speech. Currently using hardware from Fortebit (https://fortebit.tech/). Have (twice!) tried using LVCS systems for dialog and it does not work for two reasons.

First, people using a spoken language interface are not news readers, and they usually have a far field mic. The problem is to enable the dialog manager to influence recognition - people can't distinguish "recognise speech" from "wreck a nice beach" - the DM can help. MRCP is too heavy handed; the Echo makes the Dialog Designer specify full zero crossing delineated utterances; the Fotebit device allows the DM to dynamically specify the predefined word groups (the 'intent') to look for.

Second, using silence to recognise utterance boundaries is limiting. I am using dynamic "word spotting" on hardware so that my (semantic) grammar can identify TCUs (turn critical units) as they happen and hence do incremental dialog. MRCP could fill the same role but the ASR of the time worked at the utterance level - utterances determined by silences.

Could we have please: 1) recognition (not training) running on the GPU of raspberry pi computers, installable using 'apt-get install'. I've seen this done for opencv's object spotting - amazing demo! On a pi zero!!

2) A well thought out API, beautifully documented. Perhaps like the one to pigpio (http://abyz.me.uk/rpi/pigpio/). C/C++ is fine as others will write the wrappers. Note also the way pigpio can run as a server on the local machine .. v. useful.

3) I would like to be able to say "Alexa play " where is a list of names that have not had ASR training. For instance I might add "ning tong pickleye po" to the song list (text). Can you make kaldi recognise a delineated sound form as being a best match for text? Might it be possible to implement something like the soundx algorithm perhaps? Might be silly; don't know. See https://en.wikipedia.org/wiki/Soundex

Looking forward to the meeting. P

jtrmal commented 3 years ago

thanks for taking the time for writing this y.

On Thu, Sep 24, 2020 at 11:24 AM petercwallis notifications@github.com wrote:

Have almost used kaldi a few times for dialog systems. Have used VoiceXML and MRCP, Alexa Skills Kit, Google Speech. Currently using hardware from Fortebit (https://fortebit.tech/). Have (twice!) tried using LVCS systems for dialog and it does not work for two reasons.

First, people using a spoken language interface are not news readers, and they usually have a far field mic. The problem is to enable the dialog manager to influence recognition - people can't distinguish "recognise speech" from "wreck a nice beach" - the DM can help. MRCP is too heavy handed; the Echo makes the Dialog Designer specify full zero crossing delineated utterances; the Fotebit device allows the DM to dynamically specify the predefined word groups (the 'intent') to look for.

Second, using silence to recognise utterance boundaries is limiting. I am using dynamic "word spotting" on hardware so that my (semantic) grammar can identify TCUs (turn critical units) as they happen and hence do incremental dialog. MRCP could fill the same role but the ASR of the time worked at the utterance level - utterances determined by silences.

Could we have please:

1.

recognition (not training) running on the GPU of raspberry pi computers, installable using 'apt-get install'. I've seen this done for opencv's object spotting - amazing demo! On a pi zero!! 2.

A well thought out API, beautifully documented. Perhaps like the one to pigpio (http://abyz.me.uk/rpi/pigpio/). C/C++ is fine as others will write the wrappers. Note also the way pigpio can run as a server on the local machine .. v. useful. 3.

I would like to be able to say "Alexa play " where is a list of names that have not had ASR training. For instance I might add "ning tong pickleye po" to the song list (text). Can you make kaldi recognise a delineated sound form as being a best match for text? Might it be possible to implement something like the soundx algorithm perhaps? Might be silly; don't know. See https://en.wikipedia.org/wiki/Soundex

Looking forward to the meeting. P

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jtrmal/kaldi2020/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX66AA4ZAD425YJ3EITSHMF6DANCNFSM4RYB6HVA .

petercwallis commented 3 years ago

Thanks for the invite to see the discussion. Two final comments. The transition from Kaldi to K2 (?) is a software (re)engineering problem. Some professional help might be useful. I am no professional but may I suggest you start by assembling a collection of use-cases the stake-holders have, and a collection of use-cases that k2 is good for, and then think about the overlap. Comment 2: One use-case is "I want speech recognition on my robot; how/what can I do with kaldi?" 25 million raspberry pi computers have been sold btw.

jtrmal commented 3 years ago

Thanks, If you don't mind I'd prefer to leave this open so that other people will see it

petercwallis commented 3 years ago

Sure. If you think there might be comments I will attend to notifications.

Do you know if I can edit the original post? I made the mistake of putting '<' and '>' around 'song' and the underlying html ignored the tag. Not sure it makes sense as it is..

Back to playing with the Sensory card from Fortebit :-/

On Fri, 25 Sep 2020 at 08:16, jtrmal notifications@github.com wrote:

Thanks, If you don't mind I'd prefer to leave this open so that other people will see it

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/jtrmal/kaldi2020/issues/10#issuecomment-698766299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY3LWJLGWPDCEBFHDWMZULSHQ7WJANCNFSM4RYB6HVA .

kkm000 commented 3 years ago

It makes sense, and in fact used in IVR systems. "One of my order items was missing"—you should recognize which one: "the leapfrog", "that frog thing", etc. A generic LV ASR does not cut it, "the leapfrog" is not common enough a bigram in a large vocabulary to come up as the best hypothesis out of context.

But I think this is likely above the toolkit level itself, rather what you build from it.

General word spotting involves garbage modeling, not a simple thing to do, and also likely task-dependent. "Silence" is also a relative concept. What if there is a TV playing in the background? This is better thought of in speech vs non-speech discrimination, and may involve literally one-shot adaptation. Doable, but not simple, and, again, depends on what is your typical "non-speech" is. I do not believe there is any ASR toolkit that provides all the "royal shortcuts" out of the box. If there were one, we would not even need to start this new project! :)

Current Kaldi API is well-documented (see http://kaldi-asr.org/doc/). We are certainly going to keep it this way.

kkm000 commented 3 years ago

"recognise speech" from "wreck a nice beach" [...]
(the 'intent') to look for...
[...] identify TCUs (turn critical units) as they happen [...]

Yes, these are NLU problems, and, IMO, must be solved with kind of a feedback loop from NLU model back into the ASR. Certainly a pain point, and a very unsolved problem. There are symbolic approaches, harking back to the Winograd times, and recently fully connectionist, attention-based work.

But no, sorry, I do not know how to fully solve them. You may find some solace in the fact that nobody does. :) Nice wishlist!

petercwallis commented 3 years ago

So this is the standard developed in the good old days of IVR systems that did address this problem from the ASR perspective: https://voicexml.org/static/Review/Oct2006/features/MRCP.html The outcome of that body of work was that us dialog people couldn't get telephone banking to work even when the speech recognition was actually good. Luckily for us dialog people, the machine learning people think they can solve it and they have stirred up a shit load of funding :-))) Now I just need to figure out how to get ML people to share it with us who know what the issues are :-((((

As I say in the github issues, the Echo is effectively addressing the problem by getting the "dialog designer" to list all the things a person might say (with a little help from ML and lots of data) and then treating these entire utterances as single, silence delineated, "commands". Clever. The silence delineation is zero-crossing something or other; other voices are an issue but I believe the human voice is modelable and hence background noise is less of an issue than you may think. A pet project I'd like to see is auditory scene analysis, so that the source of a voice can be localised - probably with hardware based on something like the respeaker 6 https://respeaker.io/6_mic_array/ (the demo video is impressive but ..)

P

On Fri, 2 Oct 2020 at 01:57, kkm000 notifications@github.com wrote:

"recognise speech" from "wreck a nice beach" [...] (the 'intent') to look for... [...] identify TCUs (turn critical units) as they happen [...]

Yes, these are NLU problems, and, IMO, must be solved with kind of a feedback loop from NLU model back into the ASR. Certainly a pain point, and a very unsolved problem. There are symbolic approaches, harking back to the Winograd times, and recently fully connectionist, attention-based work.

But no, sorry, I do not know how to fully solve them. You may find some solace in the fact that nobody does. :) Nice wishlist!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/jtrmal/kaldi2020/issues/10#issuecomment-702471963, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY3LWPVCFM5E75XUJYVNXTSIUQQ7ANCNFSM4RYB6HVA .

petercwallis commented 3 years ago

I couldn't see how to put the angle brackets into my original post so I have put (a heavily edited version) on my blog here: https://tremarden.site/index.php/2020/10/08/asr-for-conversational-ai/