watsonbox / pocketsphinx-ruby

Ruby speech recognition with Pocketsphinx
MIT License
259 stars 40 forks source link

Hypothesis changes for the same audio file. #10

Open ojak opened 9 years ago

ojak commented 9 years ago

I'm seeing an unexpected behavior while processing a fixed audio file. The hypothesis will occasionally change each time I decode the same file. I'm not sure if this is intended behavior or byproduct of how the decoder works, or a configurable option (some sort of random/pseudorandom generator, noise-reduction, phonetic hash sorting issue, warm-up, etc).

Here's an example of what I'm seeing using a 16-bit, 16000Hz PCM Wave file containing the spoken word _"hello"_:

$ pry
> require 'pocketsphinx-ruby'
> Pocketsphinx.disable_logging
> decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> 5.times { |n| decoder.decode('hello.wav'); puts decoder.hypothesis }
oh
hello
hello
hello
hello
=> 5

Anybody have any insight as to what might be happening?

nshmyrev commented 9 years ago

Decoder requires CMN estimations (those numbers printed in log) in the beginning. We might implement proper CMN one day, just not there yet.

You can set initial CMN estimation for your device with -cmninit option or edit model/feat.params.

ojak commented 9 years ago

@nshmyrev I see, thanks for pointing that out. However, I'm unclear what the initial value would be, since it's a moving target based on the input device. My understanding is that any time a different microphone is used, or a microphone channel input level is changed on a device, the CMN needs to be adjusted (CMN = Cepstral Mean Normalization, for anyone new to this who is reading).

I found the CMUSphinx ticket from 2010 where you discuss this issue in more detail. In it, you also mentioned a possible workaround:

To calibrate CMN you need speech unfortunately. It will not give you proper estimate on silence. Algorithm I propose is the following: 1) no initial estimate -> record full utterance -> normalize only last CMN (current mode) -> decode 2) few decoding cycles are done -> have reliable CMN estimate -> normalize CMN (live_mode)

From that ticket, if I understand it correctly, you are saying that you will always need to decode the first utterance (probably poorly) to determine a CMN for the device, and then re-decode the same utterance with a properly set CMN. For example:

If that's the case, does that mean every new device basically needs to be decoded in two passes for each user session?

watsonbox commented 9 years ago

Very interesting I hadn't realized this. Might be nice to have some facility in pocketsphinx-ruby which can figure out some values for a given device and serialize them for later re-use. Would these values be worth re-using for a given device with the same sensitivity/noise level? Are they independent of speaker/accent? Perhaps an implementation could even detect when the pre-supplied values were out by a certain tolerance?

ojak commented 9 years ago

Well, this is what I saw for the "hello" example above:

> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "40,3,-1"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil

Since the cmninit value is remembered for the next decode (ie. it's stateful), this is potentially problematic for any scenario where:

  1. This is the first decode for a newly booted application
  2. This is the first decode from a new device
  3. This is the first decode at a new input level
  4. This is an API that is handling decodes from multiple devices (some new, some existing)

The main problem I see here from a production application standpoint is that each device (ie. each session) will have a different value that needs to be warmed up (ie. Pass 1 from above), and persisted (ie. Pass 2 from above, for example, via a unique device ID, or session cookie, etc). In any real-world application, this sort of thing is the job of the application controller to ingest the session identifier and the application model to persist, and not so much that of the pocketsphinx wrapper. That said, setting a default CMN is an unreliable approach and a shifting hypothesis will probably turn most developers away unless they understand this issue (ie. using the gem gives the initial impression that the voice recognition simply doesn't work).

@watsonbox This is such a critically important issue to get right otherwise the gem will perform very poorly (and be unusable in most cases), even though it's not a gem limitation, per se. I'm trying to think of the most sensible way to handle this sort of issue (ie. detailed README overview, explicit method for per speaker initialization, or something else)?

ojak commented 9 years ago

@watsonbox Another possibility is to enable a brute-force approach by default. Something like:

  1. If the input is an audio file, the default is a two-pass decode (that can optionally be disabled). It's a pretty large performance hit (at least 2x slower), but should result in much more reliable decodes.
  2. If the input is live, the first utterance is automatically two-pass decoded (optionally disabled), and every subsequent utterance is one-pass decoded using the CMN from the first utterance. Optionally, every N seconds an automatic two-pass decode could be triggered to re-calibrate?
nshmyrev commented 9 years ago

decoder.configuration.details('cmninit')[:value] => "8.0"

This looks like a software bug, it should be 40,3,-1 from the beginning. Let me check this issue.

ojak commented 9 years ago

@nshmyrev I'm having trouble locating any documentation regarding how cmninit works and what the comma-delimited values are. Could you point me toward any detailed docs or code? Thx.

ojak commented 9 years ago

@nshmyrev @watsonbox Actually, it looks as though there's a bunch of parameters that are not being set during the initialization of Configuration that are somehow being set magically after the first decode:

> decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> decoder.configuration.changes
=> []
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.changes
=> [{:name=>"cmninit", :type=>:string, :default=>"8.0", :required=>false, :value=>"40,3,-1", :info=>"Initial values (comma-separated) for cepstral mean when 'prior' is used"},
 {:name=>"fdict", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/noisedict", :info=>"Noise word pronunciation dictionary input file"},
 {:name=>"featparams", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/feat.params", :info=>"File containing feature extraction parameters."},
 {:name=>"lifter", :type=>:integer, :default=>0, :required=>false, :value=>22, :info=>"Length of sin-curve for liftering, or 0 for no liftering."},
 {:name=>"lowerf", :type=>:float, :default=>133.33334, :required=>false, :value=>130.0, :info=>"Lower edge of filters"},
 {:name=>"mdef", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/mdef", :info=>"Model definition input file"},
 {:name=>"mean", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/means", :info=>"Mixture gaussian means input file"},
 {:name=>"nfilt", :type=>:integer, :default=>40, :required=>false, :value=>25, :info=>"Number of filter banks"},
 {:name=>"sendump", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/sendump", :info=>"Senone dump (compressed mixture weights) input file"},
 {:name=>"svspec", :type=>:string, :default=>nil, :required=>false, :value=>"0-12/13-25/26-38", :info=>"Subvector specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38)"},
 {:name=>"tmat", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/transition_matrices", :info=>"HMM state transition matrix input file"},
 {:name=>"transform", :type=>:string, :default=>"legacy", :required=>false, :value=>"dct", :info=>"Which type of transform to use to calculate cepstra (legacy, dct, or htk)"},
 {:name=>"upperf", :type=>:float, :default=>6855.4976, :required=>false, :value=>6800.0, :info=>"Upper edge of filters"},
 {:name=>"var", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/variances", :info=>"Mixture gaussian variances input file"}]
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil
ojak commented 9 years ago

Ok, so it appears that something is going a bit haywire in the Configuration class:

At this point, I'm pretty confused as to what's actually going on with cmninit and what effect, if any, it is having on the resulting decoder. Below are two examples that reproduce the issue consistently:

Example when using default model files (ie. not setting hmm):

> decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "40,3,-1"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil

Example when using a custom adapted model (ie. setting hmm):

> configuration = Pocketsphinx::Configuration.default
> configuration['hmm'] = '/tmp/custom_sphinxtrain_acoustic_model_folder'
> decoder = Pocketsphinx::Decoder.new(configuration)
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
cloaked
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil

Here's a link to the hello.wav file I'm using.

Any thoughts or help with fixing this would be great, as I'm pretty sure it makes the gem unusable without hacking around and decoding with multiple passes. Thx!

ojak commented 9 years ago

@nshmyrev Also, I tried your suggestion regarding setting the -cmninit 40,3,-1 in the feat.params file, and that resulted in an empty decode on the first decode (the second decode works as in all the other attempts above). This seems to support the notion that something is incomplete or broken during the default configuration step.

watsonbox commented 9 years ago

You should be aware that the -cmninit value is just that - an initialization. It doesn't change after each decoding. However, things do get a little confusing because immediately before the first decode these values are read from feat.params and override the configuration. The values of 40,3,-1 are indeed coming from the default feat.params, but are being set before, not after, the first decoding. You'll note from the logs something like:

INFO: cmn_prior.c(131): cmn_prior_update: from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
INFO: cmn_prior.c(149): cmn_prior_update: to   < 48.76  9.26 -4.89 16.89 -25.00  8.89 16.14 -2.51 -4.90 -9.73  4.75 -4.32 -1.49 >

These are the updated CMN values after the first decoding. What you can do is remove the -cmninit value from feat.parms and then use those more appropriate values by setting them in the config:

decoder = Decoder.new(Configuration.default)
decoder.configuration['cmninit'] = %w{48.76  9.26 -4.89 16.89 -25.00  8.89 16.14 -2.51 -4.90 -9.73  4.75 -4.32 -1.49}.join(',')

Then you'll get correct recognition the first time. Doing this automatically would require a way of getting these dynamic values out of Pocketsphinx and then perhaps comparing them to the previous values using some tolerance to decide whether audio needed re-decoding. I'm not against putting this kind of thing into pocketsphinx-ruby. It could be an alternative Decoder implementation. Some info in the README would probably also be useful.

ojak commented 9 years ago

@watsonbox Regarding -cmninit, if it is an initialization only, then I think that most would expect that calling Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default) would complete such initialization in a single step. If I'm understanding you correctly, the current implementation has two separate "initialization" processes, which is extremely confusing.

This is verified in the example above where decoder.configuration.changes returns different results after running a decode (even if the second initialization is actually happening just before the decode).

ojak commented 9 years ago

@watsonbox Thanks for the help. I removed -cmninit from the local gem's installation directory (located at /usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/feat.params) and then ran a decode to copy the console dump values that are appropriate for my machine the way you suggested above and did manage to get a successful decode on the first pass by executing the following before the first decode:

# Apply initialization values from my machine
decoder.configuration['cmninit'] = %w{60.34  9.26  8.85 -8.44 -20.02  3.01 -6.74 -2.74 -9.94 -1.46  1.94  0.64  9.29}.join(',')

Based on my experience with this process, I'm more convinced that the default behavior for the gem would be greatly improved by a brute-force configuration as suggested above. This is mainly because although there's so much great work in this gem, the current out-of-the-box configuration doesn't really work as expected and gives the wrong impression of the otherwise powerful tools. It would suck if other developers were to pass it by because they were unaware of all the configuration nuances contained in this issue ticket.

I'd personally rather see an implementation that favors decoder accuracy and configuration clarity over speed by default, and then allows for developers to improve execution speed via optimizations (configuration files or monkeying around with cutting and pasting cmninit values from a console dump). To recap, I'd vote for the following default gem behavior:

What do you think?

nshmyrev commented 9 years ago

@watsonbox

However, things do get a little confusing because immediately before the first decode these values are read from feat.params and override the configuration.

Why do you lazy init here? There are many points of failure during initialization and they must be reported in constructor I think.

watsonbox commented 9 years ago

Okay so I think there are two issues here:

  1. The two-phase decoder initialization. I hadn't actually realized that the act of initializing the decoder was altering configuration until investigating this issue. I usually prefer to lazy initialize things as a matter of course since this allows for more flexible dependency injection and resources are not allocated until required. In this case I agree that it would be more logical if the ps_init actually happened on a call to Decoder.new. Note that there would still be a 'two-phase' initialization - one when creating a Configuration and one when creating a Decoder.
  2. A way around the CMN 'cold start' problem. I like your suggestions but here I'd need to take a good look into what API methods are available for getting this dynamic CMN data out after decoding. I'll do this as soon as I have time. Any PRs more than welcome!

I've created #12 to track point 1, so this issue is only concerned with point 2.

nshmyrev commented 9 years ago

I agree that this is an important issue and it was mentioned by our users frequently. I believe we can fix this in pocketsphinx itself, it just needs some work.

I can propose you to move this issue to pocketsphinx.

watsonbox commented 9 years ago

@nshmyrev Yes I agree that the best solution would be to resolve this in Pocketsphinx itself.

In the meantime I've had a play with a CMNDecoder implementation which will repeat the decoding if the CMN values are not within a certain tolerance of the previous set. However, this is really just an experiment since any likely solution would need to address the same issues with #process_raw as used by the higher level SpeechRecognizers.

That would require pocketsphinx-ruby to cache each utterance for possible replay, which is not currently the case and leads me to think that this would be better done in the C library.

hiyassat commented 7 years ago

I did a workaround technique I added a pre recorded audio ( 2 sec) to the first beginning of the audio file I am intending to decode and it work perfect for me and no need to calibrate cmn or cmninit