nyumaya / nyumaya_audio_recognition

Classify audio with neural nets on embedded systems like the Raspberry Pi
https://nyumaya.com
Apache License 2.0
82 stars 14 forks source link

Record my own command and hotword models? #3

Closed torntrousers closed 5 years ago

torntrousers commented 5 years ago

Can I record my own command and hotword models somehow?

yodakohl commented 5 years ago

Creating hotwords is difficult as I need to gather a lot of examples for each.  I will highly prioritize any suggestions and gather the data. I will try to give people the ability to create their own models, but this is not certain and a bit in the future. What kind of command did you think of? If it's a fairly common word I may get the job done in a few hours.

  1. Nov 2018 15:02 von notifications@github.com mailto:notifications@github.com:

Can I record my own command and hotword models somehow?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, > view it on GitHub https://github.com/nyumaya/nyumaya_audio_recognition/issues/3> , or > mute the thread https://github.com/notifications/unsubscribe-auth/AAx_S3VCP5dTu3LgSJtdfvIVB9XfhrF4ks5uraHugaJpZM4YM0HW> .

torntrousers commented 5 years ago

I'm looking for two hotwords - "lamp on" and "lights off".

yodakohl commented 5 years ago

This will be possible very soon. The multi_streaming_example is intended to show how to chain different models together.  The Speech_commands_subset model is already released and contains 'yes,no,up,down,left,right,on,off,stop,follow,play'

I'm trying to release the objects model today which contains  'music,radio,television,door,water,computer,temperature,light,house'

There are certain limitations to this concept, each model will take a certain amount of CPU. I'm working on that. But stuff like <marvin, temperature up> <marvin, music play> <marvin, light on> or even <marvin, light two on> should work without additional CPU.

Having only "light on" only as command will be possible, but might be prone to false detections.

  1. Nov 2018 15:20 von notifications@github.com mailto:notifications@github.com:

I'm looking for two hotwords - "lamp on" and 'lights off".

— You are receiving this because you commented. Reply to this email directly, > view it on GitHub https://github.com/nyumaya/nyumaya_audio_recognition/issues/3#issuecomment-435591540> , or > mute the thread https://github.com/notifications/unsubscribe-auth/AAx_S9oZI2vChOo6iC07-RVHUHPKWJ5Aks5uraYygaJpZM4YM0HW> .

torntrousers commented 5 years ago

I don't really want to have to say hotword + command, I'd much prefer just "lamp on" and "lights off". I do appreciate that this could give false triggers, but I hope having the mic level set low so you have to be near would help. I've tried a bit with things like Porcupine and Snowboy - using the different "lamp" and "light" in the two hotwords makes them more different and seemed to help a lot with correctly detecting the right hotword, rather than "lights on" and "lights off" which was often wrong so didn't work well.

yodakohl commented 5 years ago

I would suggest just trying it. I just released the objects_small model. The multi_streaming_example has no command-line arguments yet, so you would have to modify the file, but it's not much code.

You need to change the libpath for the pi zero.  Then you would need to set the subset_small as input for the first detector. Then you check if the word "lamp" is recognized. If something else is recognized it's ignored. Then for a certain timeout the objects_model will run the detection and listen for "on" and "off".  The model should recognize the difference between on and off pretty well, but I will try to add the word lamp to the objects model in the next release.

  1. Nov 2018 15:41 von notifications@github.com mailto:notifications@github.com:

I don't really want to have to say hotword + command, I'd much prefer just "lamp on" and "lights off". I do appreciate that this could give false triggers, but I hope having the mic level set low so you have to be near would help. I've tried a bit with things like Porcupine and Snowboy - using the different "lamp" and "light" in the two hotwords makes them more different and seemed to help a lot with correctly detecting the right hotword, rather than "lights on" and "lights off" which was often wrong so didn't work well.

— You are receiving this because you commented. Reply to this email directly, > view it on GitHub https://github.com/nyumaya/nyumaya_audio_recognition/issues/3#issuecomment-435593030> , or > mute the thread https://github.com/notifications/unsubscribe-auth/AAx_S5Sx3WNloiO6_KU_kvygcHxXOi6hks5urasVgaJpZM4YM0HW> .

yodakohl commented 5 years ago

I just coded the light_switch example however the accuracy for the light word is still low. I will gather more examples of it and further train the model.

torntrousers commented 5 years ago

I've just tried it too, using the multi streaming example with Command/objects_small.tflite for the hotword and Command/subset_small.tflite for the action. Not great, hard to get "light" detected and then once it has been then its hard to get "on' or "off" detected. I appreciate its hard to train the models but i do think its likely to work much better if there were just the two hotwords "lamp on" and "lights off".

torntrousers commented 5 years ago

The "marvin" and "sheila" hotwords seemed to be detected really well, is there a way to have a model with just those two words? So I'd have "marvin" mean switch the light on, and "sheila" switch the light off.

yodakohl commented 5 years ago

Marvin and Sheila have been trained on a lot more examples. It should be possible to train light and lamp to a similar level, but the current release is based on a limited dataset. I will schedule a run for a combined marvin and sheila model which should be finished by tomorrow. I also started to gather more light and lamp examples and I'm confident that there can still be a huge improvement made. Since your hotwords are a very common Use Case I'm looking into making a "lights on", "lamps off" but getting some data will be tricky.

torntrousers commented 5 years ago

Cool, thank you so much for the combined marvin and sheila model.

(being pedantic my preference would be "lamp on" (singular lamp on) and "lights off"). I could try to get people to record themselves saying those if thats the sort of thing that would help?

yodakohl commented 5 years ago

Marvin was trained on 2000 examples, so having a few examples won't help too much in general. Having a few examples of you saying a word will definitely help, but only for your voice. I will be looking at things like combining the audio files of "lamp" and "on" and some other tricks.

torntrousers commented 5 years ago

Hello, me again. As much as me and the kids quite like it saying Marvin and Sheila to turn the lamps in the living room on and off I'd still like to be able to train my own models with more sensible words. You mentioned earlier "I will try to give people the ability to create their own models, but this is not certain and a bit in the future." - so big vote from me for that.

I've done the TensorFlow Simple Audio Recognition example, if I hacked around with that would the models it generates work with your nyumaya_audio_recognition or does the model need to be specific to that?

yodakohl commented 5 years ago

To be able to train custom models for people I will need enough audio samples. I don't think the casual user is happy with recording a few hundred samples of each keyword (Maby this is a misconception?). I'm currently investigating methods to reduce the number of required samples, but the outcome is a bit uncertain.

You can do some experiment, collect your samples and run the Simple Audio Recognition to get an idea of how well it will perform.

Currently, the usage of this repo is low enough that I can train you a model if you send me the data. If more users request this I will hit a limit very soon and need to optimize and automate tasks.

The model from the Simple Audio Recognition won't run straight away and needs some changes. It needs to be converted to a Tensorflow-lite model, ideally quantized. Use the same feature extraction, has the same, architecture, input dimensions and many more changes. But it's conceptionally the same.

torntrousers commented 5 years ago

I don't think the casual user is happy with recording a few hundred samples of each keyword (Maby this is a misconception?) I might not be your average casual user but I am happy to do that sort of thing.

I guess what I'm asking is if you would open source your code for doing the model generation too. Your code is great, compared to other projects it makes this clear and easy to understand and use and it seems really accurate and useable. All your runtime code is open source, the cleaned speech data you've done is open, and over here you note that a problem with things like Porcupine and Snowboy is that they're not open. Would be so good if there was a good complete end to end solution for this, someones going to do that sooner or later, why not Nyumaya!?

yodakohl commented 5 years ago

I'm always trying to open-source as much code as possible and will continue to do so. But in order to be able to continue to work on this, it has to generate revenue at some point. Training custom and premium models currently seem like the only way to do this. For the time being, I can't open source the training code.

What I can do is speeding up the process for training custom models on user data, by manually doing it and automating stuff in the process. I already started doing tests to determine how much data is required and finding a simple way to capture it.