Open ojak opened 9 years ago
Decoder requires CMN estimations (those numbers printed in log) in the beginning. We might implement proper CMN one day, just not there yet.
You can set initial CMN estimation for your device with -cmninit option or edit model/feat.params.
@nshmyrev I see, thanks for pointing that out. However, I'm unclear what the initial value would be, since it's a moving target based on the input device. My understanding is that any time a different microphone is used, or a microphone channel input level is changed on a device, the CMN needs to be adjusted (CMN = Cepstral Mean Normalization, for anyone new to this who is reading).
I found the CMUSphinx ticket from 2010 where you discuss this issue in more detail. In it, you also mentioned a possible workaround:
To calibrate CMN you need speech unfortunately. It will not give you proper estimate on silence. Algorithm I propose is the following: 1) no initial estimate -> record full utterance -> normalize only last CMN (current mode) -> decode 2) few decoding cycles are done -> have reliable CMN estimate -> normalize CMN (live_mode)
From that ticket, if I understand it correctly, you are saying that you will always need to decode the first utterance (probably poorly) to determine a CMN for the device, and then re-decode the same utterance with a properly set CMN. For example:
If that's the case, does that mean every new device basically needs to be decoded in two passes for each user session?
Very interesting I hadn't realized this. Might be nice to have some facility in pocketsphinx-ruby which can figure out some values for a given device and serialize them for later re-use. Would these values be worth re-using for a given device with the same sensitivity/noise level? Are they independent of speaker/accent? Perhaps an implementation could even detect when the pre-supplied values were out by a certain tolerance?
Well, this is what I saw for the "hello" example above:
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "40,3,-1"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil
Since the cmninit
value is remembered for the next decode (ie. it's stateful), this is potentially problematic for any scenario where:
The main problem I see here from a production application standpoint is that each device (ie. each session) will have a different value that needs to be warmed up (ie. Pass 1 from above), and persisted (ie. Pass 2 from above, for example, via a unique device ID, or session cookie, etc). In any real-world application, this sort of thing is the job of the application controller to ingest the session identifier and the application model to persist, and not so much that of the pocketsphinx
wrapper. That said, setting a default CMN is an unreliable approach and a shifting hypothesis will probably turn most developers away unless they understand this issue (ie. using the gem gives the initial impression that the voice recognition simply doesn't work).
@watsonbox This is such a critically important issue to get right otherwise the gem will perform very poorly (and be unusable in most cases), even though it's not a gem limitation, per se. I'm trying to think of the most sensible way to handle this sort of issue (ie. detailed README
overview, explicit method for per speaker initialization, or something else)?
@watsonbox Another possibility is to enable a brute-force approach by default. Something like:
decoder.configuration.details('cmninit')[:value] => "8.0"
This looks like a software bug, it should be 40,3,-1 from the beginning. Let me check this issue.
@nshmyrev I'm having trouble locating any documentation regarding how cmninit
works and what the comma-delimited values are. Could you point me toward any detailed docs or code? Thx.
@nshmyrev @watsonbox Actually, it looks as though there's a bunch of parameters that are not being set during the initialization of Configuration
that are somehow being set magically after the first decode:
> decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> decoder.configuration.changes
=> []
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.changes
=> [{:name=>"cmninit", :type=>:string, :default=>"8.0", :required=>false, :value=>"40,3,-1", :info=>"Initial values (comma-separated) for cepstral mean when 'prior' is used"},
{:name=>"fdict", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/noisedict", :info=>"Noise word pronunciation dictionary input file"},
{:name=>"featparams", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/feat.params", :info=>"File containing feature extraction parameters."},
{:name=>"lifter", :type=>:integer, :default=>0, :required=>false, :value=>22, :info=>"Length of sin-curve for liftering, or 0 for no liftering."},
{:name=>"lowerf", :type=>:float, :default=>133.33334, :required=>false, :value=>130.0, :info=>"Lower edge of filters"},
{:name=>"mdef", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/mdef", :info=>"Model definition input file"},
{:name=>"mean", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/means", :info=>"Mixture gaussian means input file"},
{:name=>"nfilt", :type=>:integer, :default=>40, :required=>false, :value=>25, :info=>"Number of filter banks"},
{:name=>"sendump", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/sendump", :info=>"Senone dump (compressed mixture weights) input file"},
{:name=>"svspec", :type=>:string, :default=>nil, :required=>false, :value=>"0-12/13-25/26-38", :info=>"Subvector specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38)"},
{:name=>"tmat", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/transition_matrices", :info=>"HMM state transition matrix input file"},
{:name=>"transform", :type=>:string, :default=>"legacy", :required=>false, :value=>"dct", :info=>"Which type of transform to use to calculate cepstra (legacy, dct, or htk)"},
{:name=>"upperf", :type=>:float, :default=>6855.4976, :required=>false, :value=>6800.0, :info=>"Upper edge of filters"},
{:name=>"var", :type=>:string, :default=>nil, :required=>false, :value=>"/usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/variances", :info=>"Mixture gaussian variances input file"}]
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil
Ok, so it appears that something is going a bit haywire in the Configuration
class:
hmm
is left as default before the decode, then cmninit
changes from 8
to 40,3,-1
after the decodehmm
is set to a custom model (generated with sphinxtrain-ruby
), then cmninit
remains at the default value of 8
after the decodeAt this point, I'm pretty confused as to what's actually going on with cmninit
and what effect, if any, it is having on the resulting decoder. Below are two examples that reproduce the issue consistently:
Example when using default model files (ie. not setting hmm
):
> decoder = Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
oh
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "40,3,-1"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil
Example when using a custom adapted model (ie. setting hmm
):
> configuration = Pocketsphinx::Configuration.default
> configuration['hmm'] = '/tmp/custom_sphinxtrain_acoustic_model_folder'
> decoder = Pocketsphinx::Decoder.new(configuration)
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
cloaked
=> nil
> decoder.configuration.details('cmninit')[:value]
=> "8.0"
> decoder.decode('hello.wav'); puts decoder.hypothesis
hello
=> nil
Here's a link to the hello.wav file I'm using.
Any thoughts or help with fixing this would be great, as I'm pretty sure it makes the gem unusable without hacking around and decoding with multiple passes. Thx!
@nshmyrev Also, I tried your suggestion regarding setting the -cmninit 40,3,-1
in the feat.params
file, and that resulted in an empty decode on the first decode (the second decode works as in all the other attempts above). This seems to support the notion that something is incomplete or broken during the default configuration step.
You should be aware that the -cmninit
value is just that - an initialization. It doesn't change after each decoding. However, things do get a little confusing because immediately before the first decode these values are read from feat.params
and override the configuration. The values of 40,3,-1 are indeed coming from the default feat.params
, but are being set before, not after, the first decoding. You'll note from the logs something like:
INFO: cmn_prior.c(131): cmn_prior_update: from < 40.00 3.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 48.76 9.26 -4.89 16.89 -25.00 8.89 16.14 -2.51 -4.90 -9.73 4.75 -4.32 -1.49 >
These are the updated CMN values after the first decoding. What you can do is remove the -cmninit
value from feat.parms
and then use those more appropriate values by setting them in the config:
decoder = Decoder.new(Configuration.default)
decoder.configuration['cmninit'] = %w{48.76 9.26 -4.89 16.89 -25.00 8.89 16.14 -2.51 -4.90 -9.73 4.75 -4.32 -1.49}.join(',')
Then you'll get correct recognition the first time. Doing this automatically would require a way of getting these dynamic values out of Pocketsphinx and then perhaps comparing them to the previous values using some tolerance to decide whether audio needed re-decoding. I'm not against putting this kind of thing into pocketsphinx-ruby
. It could be an alternative Decoder
implementation. Some info in the README would probably also be useful.
@watsonbox Regarding -cmninit
, if it is an initialization only, then I think that most would expect that calling Pocketsphinx::Decoder.new(Pocketsphinx::Configuration.default)
would complete such initialization in a single step. If I'm understanding you correctly, the current implementation has two separate "initialization" processes, which is extremely confusing.
This is verified in the example above where decoder.configuration.changes
returns different results after running a decode (even if the second initialization is actually happening just before the decode).
@watsonbox Thanks for the help. I removed -cmninit
from the local gem's installation directory (located at /usr/local/Cellar/cmu-pocketsphinx/HEAD/share/pocketsphinx/model/en-us/en-us/feat.params
) and then ran a decode to copy the console dump values that are appropriate for my machine the way you suggested above and did manage to get a successful decode on the first pass by executing the following before the first decode:
# Apply initialization values from my machine
decoder.configuration['cmninit'] = %w{60.34 9.26 8.85 -8.44 -20.02 3.01 -6.74 -2.74 -9.94 -1.46 1.94 0.64 9.29}.join(',')
Based on my experience with this process, I'm more convinced that the default behavior for the gem would be greatly improved by a brute-force configuration as suggested above. This is mainly because although there's so much great work in this gem, the current out-of-the-box configuration doesn't really work as expected and gives the wrong impression of the otherwise powerful tools. It would suck if other developers were to pass it by because they were unaware of all the configuration nuances contained in this issue ticket.
I'd personally rather see an implementation that favors decoder accuracy and configuration clarity over speed by default, and then allows for developers to improve execution speed via optimizations (configuration files or monkeying around with cutting and pasting cmninit
values from a console dump). To recap, I'd vote for the following default gem behavior:
What do you think?
@watsonbox
However, things do get a little confusing because immediately before the first decode these values are read from feat.params and override the configuration.
Why do you lazy init here? There are many points of failure during initialization and they must be reported in constructor I think.
Okay so I think there are two issues here:
ps_init
actually happened on a call to Decoder.new
. Note that there would still be a 'two-phase' initialization - one when creating a Configuration
and one when creating a Decoder
.I've created #12 to track point 1, so this issue is only concerned with point 2.
I agree that this is an important issue and it was mentioned by our users frequently. I believe we can fix this in pocketsphinx itself, it just needs some work.
I can propose you to move this issue to pocketsphinx.
@nshmyrev Yes I agree that the best solution would be to resolve this in Pocketsphinx itself.
In the meantime I've had a play with a CMNDecoder implementation which will repeat the decoding if the CMN values are not within a certain tolerance of the previous set. However, this is really just an experiment since any likely solution would need to address the same issues with #process_raw
as used by the higher level SpeechRecognizer
s.
That would require pocketsphinx-ruby to cache each utterance for possible replay, which is not currently the case and leads me to think that this would be better done in the C library.
I did a workaround technique I added a pre recorded audio ( 2 sec) to the first beginning of the audio file I am intending to decode and it work perfect for me and no need to calibrate cmn or cmninit
I'm seeing an unexpected behavior while processing a fixed audio file. The hypothesis will occasionally change each time I decode the same file. I'm not sure if this is intended behavior or byproduct of how the decoder works, or a configurable option (some sort of random/pseudorandom generator, noise-reduction, phonetic hash sorting issue, warm-up, etc).
Here's an example of what I'm seeing using a 16-bit, 16000Hz PCM Wave file containing the spoken word _"hello"_:
Anybody have any insight as to what might be happening?