jtrmal / kaldi2020

Apache License 2.0
27 stars 3 forks source link

5 top-level use-cases for k2 #37

Open petercwallis opened 3 years ago

petercwallis commented 3 years ago

Based on last weeks meeting, I'd suggest there are two sets of stake-holders: 1) ASR researchers and 2) kaldi users. From a very peripheral view I think the ASR researchers have two use-cases:

1) a researcher wants to reproduce results from someone else's paper, and then explore variants. A good reason for doing this is when a published result was produced with massive compute power but success credited to a novel technique with a cool name. 2) a researcher wants to develop a novel component to an ASR system, and uses kaldi to provide everything else -infrastructure, peripherals for experimentation, etc

For the users, a developer wants to have ASR as part of another project and, for whatever reason, doesn't want to use cloud services: 3) the developer wants a speech interface based on LVCS recognition and the classic pipeline model - sound to transcript to meaning - which works fine for speech "command" systems but not so well for unconstrained input. For unconstrained input the cloud services work better (big computers and far more training data) and the developer should perhaps not be using kaldi. 4) the developer has a new language/vocabulary and wants to train kaldi models for use in a speech interface. In this case kaldi is [a good / the only] option. 5) the developer wants a speech interface for a dialog system where the vocabulary is limited (like command systems) but the input is not (like LVCS). The naive approach is to use "wild cards" in speech grammars or "word spotting". The point is that the developer's envisaged system can provide information that is useful for the ASR, possibly giving better performance (although not based on WER) than commercial LVCS systems for the task.

Of these, 1 is good science but not that interesting, and 3 is misguided. From what I saw it looks like 4 is a popular usage that makes sense. I suspect that 2 and 5 are closely related, but requires far more conversation. That is the conversation I would like to contribute to.

Can I also point out, Daniel, that kaldi is famous because people use it. It could be more famous if more people use it successfully. Having 'apt-get install' on a raspberry pi would guarantee lots of downloads and all though you may not want to do it, it would be good to do. Could you find the money to pay someone perhaps, or perhaps someone might do it for you if you can put their name on a paper or two.

nshmyrev commented 3 years ago

the developer wants a speech interface for a dialog system where the vocabulary is limited (like command systems) but the input is not (like LVCS).

Hi Peter. Very few people want to recognize commands these days. With few commands you get a toy system anyway because it is easier to press a big red button than to shout "left" and "right" and wait for half a second for system to respond. People get used to assistants and want to recognize large vocabulary, sometimes crazy large vocabulary of million words on Raspberry Pi. Here we have a problem. For example, you can install Vosk with pip on RPi3 but it is far from accurate due to CPU restriction (RPi is much slower than the phone). Some work to do on quantization and multithreading here.

You are welcome to try yourself.

petercwallis commented 3 years ago

Hi and thanks for the feedback - I was throwing this out there to get this kind of discussion going - I am in over my head with speech research - but that is how cross disciplinary research has to work.

I will agree that the "big success" of recent years is the Large Vocabulary Continuous Speech (LVCS) recognition - my wife is currently correcting automatic transcriptions of her lectures and the Word Error Rate (WER) is very low - she is impressed. However, just because that is impressive does not mean people don't want to do other things. The Echo is also impressive and that is not doing LVCS, but is rather doing what I call above, "command" recognition - just a stupidly large number of commands (that get mapped into a much smaller number of "intent"s). More explicit examples of command recognition are the in car speech interfaces, and yes, you are right; people would prefer a button (with several caveats).

What one can not do effectively with LVCS, nor with command recognition, is implement the patterns in the ELIZA chatbot. This mechanism is how most (and indeed all successful?) conversational AI systems work and, I claim, this is something lots of people want to do. The problem is that these patterns contain "wild cards" - that is regex ".*" type of expressions. I have been in a talk by serious AI people (EU 2020 project) in which the language understanding was achieved they said using a "magic regular expression" - which got a guilty laugh from everyone. I want to do something very similar to these regular expressions and am using the hardware used for Wake Word Detection to do "word spotting" in continuous speech, but in a rather convoluted way. I believe there might be an opportunity for the kaldi community here.

Agree about RPi cpu being tiny (although Pi4 now has a grown up cpu with power comparable to a modern mobile phone) but the gpu on a pi has always been reasonable. I have seen openCV doing object detection blindingly fast on the GPU of a RPi Zero. I think there would be a great demo and a very useful "peripheral" with Kaldi recognition running on a linux computer that costs under 10 dollars.

nshmyrev commented 3 years ago

Thank you Peter, I didn't understand that RPi has GPU before which is accessible through opencl. I'll take a look closer on it.

petercwallis commented 3 years ago

Glad to help. Keep in mind that the latest raspberry pi is the Pi4 which has a significantly bigger CPU, but the ASR problem looks, to me, like the other things (eg ML vision) people have been doing on the GPU of a rpi for a while now. I think that the thread you opened called "Use VideoCore through OpenCL on RPi3" is there to run with this? Great!

This is exciting, but I want to claim that there is another opportunity - a bigger opportunity :-)

That opportunity is to revisit the way down-stream Natural Language Understanding (NLU) can help the ASR process by providing expectations. This is not (historically) a new idea, but it does seem to be have lost to history and might be an interesting thing to do with kaldi.

kkm000 commented 3 years ago

Peter, people indeed use Kaldi on devices, like smartphones and tablets, pretty much comparable in the CPU power with the Pi3 and Pi4. You reminded me I have a pending ticket for fixing the configure script for one of the cases. Most ARM v7 and v8 have NEON extensions (128 bit SIMD) that are taken advantage of by math libraries (ACL is more vision-oriented but certainly an option, ans, IIRC, OpenBLAS has kernels for NEON, too). For the ?gemm-heavy decode part (the AM), halving the precision to float16 more than doubles ?gemm perf (Pi4's A72 does indeed support float16_t), and reduces the model size in half, without sacrificing much accuracy with good engineering and/or good luck (it's data witchcraft, not data science anyway).

I do not think decoding on the GPU is common if used at all, as the current Kaldi is CUDA-only. That is likely to change, as the frameworks we are planning to support have device-optimized versions. Tensorflow would certainly make access to the GPU easier.

As for the proposed taxonomy of uses, I can think of areas that would be hard to fit. Education, is just one example.

The dichotomy between “researchers” and “users” is also far from hard and fast, as device limitation necessarily require architecting a suitable model. Not so much really novel, on the scale of the invention of LF-MMI or CTC, but certainly a lot of model building skills and literature research is required. Engineering constrains are much tighter, and tradeoffs are more significant, so it's possible that there is not even a single "apt install", one-size-fits-anything model even for a single language. If you are reducing precision, you are likely to account for that in training too. Pruning the network—and (Sze et al., 2017) has over 1K citations!—may provide significant benefits as well. As with nearly everything, pulling something out of the box and making it "just work" is only the starting point, especially on a resource-constrained hardware.

I do not really think the world has abandoned the idea ASR and NLU marriage. Recruiting attention for this very task was mentioned at the 3rd session (I forgot who the panelist was, my memory for names is nearly non-existent).

petercwallis commented 3 years ago

Sure i agree it is one big bowl of spagetti, but as someone said in the first session, presenting it to the world that way really limits ones market segment. To which Daniel said "its free - no need to think about market segments" and I guess it is that comment that made me want to distinguish between researchers and users. Researchers (well, people wearing their researcher hat) are doing it because it is interesting - and successful researchers are doing it for the publications. Users (people wearing their user hat) want to solve a problem they have with minimal fuss. In this second case the APIs really matter. I am in this second camp and bring that perspective to the process I hope. My problem-to-solve is speech for dialog, which is only one problem and quite specific, but I think you'll agree it is a significant "market segment" in the current climate.

For the ASR-NLU interface, I think the world has abandoned the idea of a literature search. My hero, Yorick Wilks once said in a review of a book by Jackendorf that "the greatest asset in AI research is a memory that does not go back more than 5 years" Ouch! - and that was in the 1880s! The four ASR-NLU interfaces I would like to highlight are:

  1. LVCS | NLU (the pipeline model) which is poor for dialog systems for several reasons (timing is a big one), and what is more, we can get better WER by making the interface bidirectional.
  2. "command" systems in which sound is divided up into segments (usu. based on silence) and the entire segment is converted to text. This, many do not realise, is how the Echo works.
  3. MRCP https://voicexml.org/static/Review/Oct2006/features/MRCP.html was a standard argued out in the hayday of IVR systems
  4. "word spotting" - this is hard and unreliable, but in the context of a dialog system is my approach of choice. A commercial API for this is described here https://fortebit.tech/speech-recognition/

None of these are perfect for dialog systems; I think there is an opportunity to do better. I do not know what that better would look like, but I think I have stuff to contribute.

Thanks for the conversation btw :-) P

On Sat, 3 Oct 2020 at 10:55, kkm000 notifications@github.com wrote:

Peter, people indeed use Kaldi on devices, like smartphones and tablets, pretty much comparable in the CPU power with the Pi3 and Pi4. You reminded me I have a pending ticket for fixing the configure script for one of the cases. Most ARM v7 and v8 have NEON extensions (128 bit SIMD) that are taken advantage of by math libraries (ACL is more vision-oriented but certainly an option, ans, IIRC, OpenBLAS has kernels for NEON, too). For the ?gemm-heavy decode part (the AM), halving the precision to float16 more than doubles ?gemm perf (Pi4's A72 does indeed support float16_t), and reduces the model size in half, without sacrificing much accuracy with good engineering and/or good luck (it's data witchcraft, not data science anyway).

I do not think decoding on the GPU is common if used at all, as the current Kaldi is CUDA-only. That is likely to change, as the frameworks we are planning to support have device-optimized versions. Tensorflow would certainly make access to the GPU easier.

As for the proposed taxonomy of uses, I can think of areas that would be hard to fit. Education, is just one example.

The dichotomy between “researchers” and “users” is also far from hard and fast, as device limitation necessarily require architecting a suitable model. Not so much really novel, on the scale of the invention of LF-MMI or CTC, but certainly a lot of model building skills and literature research is required. Engineering constrains are much tighter, and tradeoffs are more significant, so it's possible that there is not even a single "apt install", one-size-fits-anything model even for a single language. If you are reducing precision, you are likely to account for that in training too. Pruning the network—and (Sze et al., 2017) has over 1K citations!—may provide significant benefits as well. As with nearly everything, pulling something out of the box and making it "just work" is only the starting point, especially on a resource-constrained hardware.

I do not really think the world has abandoned the idea ASR and NLU marriage. Recruiting attention for this very task was mentioned at the 3rd session (I forgot who the panelist was, my memory for names is nearly non-existent).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jtrmal/kaldi2020/issues/37#issuecomment-703078531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY3LWO6RQXSBZGNJMMVKEDSI3YHZANCNFSM4SAMFEZQ .