Model test results - model 20240117

jwebmeister commented 10 months ago

Post test results + useful remarks here, ideally of both:

the new model (20240117), and
the base model (kaldi_model_daanzu_20211030-mediumlm)

, using the same test data, and using the default Ready or Not grammar module.

Useful remarks include:

specific words or phrases that were consistently misrecognised
rate of false positives / false negatives
subjective opinion, which model works better or worse, and in which areas

Important instructions:

It's important to manually review, clean and update retain.tsv with the correct rules + text, see example workflow near the end of these instructions
See this YouTube video
Please only include "normal commands" in the test data, please exclude "Freeze", etc.
- There's a script ./scripts/copy_retain_item_cmds_only.ps1 that can be used in PowerShell to copy only "normal commands" out of ./retain/ and into ./cleanaudio_cmds/
Please, if possible, only use the default _readyornot.py grammar module, or very minor modifications, i.e. no new words.
Example command to run test ./tacspeak.exe --test_model './cleanaudio_cmds/retain.tsv' './kaldi_model/' './kaldi_model/lexicon.txt' 4
There are a number of useful PowerShell scripts in the ./scripts/ folder related to cleaning up the retain.tsv and related .wav files.
A workflow I use for cleaning up the data after a play session:
- Open retain.tsv and go through each line, reviewing the rule and text
- At the same time, load into a playlist every .wav file in the ./retain/ folder in VLC media player on single file loop, pressing 'N' to move to next .wav as I read through each line of retain.tsv
- When there's a mismatch between the text vs the audio, but the rule is correct, I correct the text in retain.tsv to align with the audio.
- When there's a mismatch between the recognised rule (and/or option) vs the audio, I either A) update both the rule + text manually, or B) delete the line in retain.tsv, then when I'm done reviewing I run the list_wav_missing_from_retain_tsv.ps1 first to make sure I'm deleting the right files, then run delete_wav_missing_from_retain_tsv.ps1 script (option A is preferred, but hey we're all busy and life is too short to spend cleaning all the data).
- If the audio is so stupidly vague or garbled that I can't understand with my own ears and brain what I'm saying, I delete the line in retain.tsv, then when I'm done reviewing I run the list_wav_missing_from_retain_tsv.ps1 first to make sure I'm deleting the right files, then run delete_wav_missing_from_retain_tsv.ps1 script.

Example report:

0 incorrect commands out of 4 cmds (1 missions played), same result both models
5% WER, same result both models
new model more often picks up baby crying as "freeze", using "listen_key_toggle":-1, using USE_NOISE_SINK = True; also picked up in base model but not as often.
New model tended to pick up "red" as "gold" when wife was speaking
using default _readyornot.py without any modifications
'./kaldi_model/' is new model
'./kaldi_model_base/' is base model

('./kaldi_model/', './retain/retain.tsv', 'Command', 'WER', 'Overall -> 5.00 %+/- 9.55 %N=20 C=19 S=1 D=0 I=0') ('./kaldi_model/', './retain/retain.tsv', 'Command', 'CMDERR', {'cmd_not_correct_output': 0, 'cmd_not_correct_rule': 0, 'cmd_not_correct_options': 0, 'cmd_not_recog_output': 0, 'cmd_not_recog_input': 0, 'cmds': 4}) ('./kaldi_model_base/', './retain/retain.tsv', 'Command', 'WER', 'Overall -> 5.00 %+/- 9.55 %N=20 C=19 S=0 D=1 I=0') ('./kaldi_model_base/', './retain/retain.tsv', 'Command', 'CMDERR', {'cmd_not_correct_output': 0, 'cmd_not_correct_rule': 0, 'cmd_not_correct_options': 0, 'cmd_not_recog_output': 0, 'cmd_not_recog_input': 0, 'cmds': 4})

jwebmeister commented 9 months ago

nvm was thinking maybe different types of doors were named/coded a particular doortype.

You can specify “wedge the door” or “wedge the trapped door”, just as one example. It’s all in the grammar module. I haven’t noticed it causing any issues in my testing though.

madmaximus101 commented 9 months ago

You can specify “wedge the door” or “wedge the trapped door”, just as one example. It’s all in the grammar module. I haven’t noticed it causing any issues in my testing though.

I will look at the grammar module more deeply for the proper words/phrases.

jwebmeister commented 9 months ago

I will test the blue/red cut off audio thing aswell.

@madmaximus101 don’t worry I figured it out. It was my audio settings. I had a gate setup that was just slightly too slow and/or too high.

jwebmeister commented 9 months ago

The things I've gathered so far from reviewing your test data + videos @madmaximus101 :

"gold" and "hold" get misrecognized
- grammar module issue, new issue raised
model recognises some noise as commands, e.g. silence or random noise = "blue" or "freeze".
- Might be too small vocab in dataset, or excessive tuning, might be an issue with the fine-tuning process of Kaldi models in general (as SME advised), or the specific fine-tuning process for Kaldi Active Grammar.
colours get misrecognised as another colour, e.g. "red" = "blue", "blue" = "red.
- Might be pronunciation within the model, or it might be the same as the issue above, recognising silence or cut-audio (my stupid audio gate settings) as another colour. Needs more testing.
"mirror the door" misrecognised as "wedge the door"
"on me" misrecognised as "remove the wedge"
"on me" misrecognised as "pie room"
"gold on me" misrecognised as "gold halt"

@madmaximus101 can you please review and let me know what's missing?

madmaximus101 commented 9 months ago

The things I've gathered so far from reviewing your test data + videos @madmaximus101 :

"gold" and "hold" get misrecognized

grammar module issue, new issue raised

model recognises some noise as commands, e.g. silence or random noise = "blue" or "freeze".

Might be too small vocab in dataset, or excessive tuning, might be an issue with the fine-tuning process of Kaldi models in general (as SME advised), or the specific fine-tuning process for Kaldi Active Grammar.

colours get misrecognised as another colour, e.g. "red" = "blue", "blue" = "red.

Might be pronunciation within the model, or it might be the same as the issue above, recognising silence or cut-audio (my stupid audio gate settings) as another colour. Needs more testing.

"mirror the door" misrecognised as "wedge the door"

"on me" misrecognised as "remove the wedge"

"on me" misrecognised as "pie room"

"gold on me" misrecognised as "gold halt"

@madmaximus101 can you please review and let me know what's missing?

I think if you're speaking a command of any kind, but looking at a door/entryway, or suspect/teammate. Regardless of what you say. It will execute whatever it thinks you said that is available in that command menu at the time. "on me" being recognised as pie room might be one of those. Unless fall in is available as a command in the command menu when looking at a door - will actually check this to make sure.

I've had consistent misrecognitions with "on me". Not as much with my refined mic settings though. "Fall in" pretty much works all the time. I can't remember it not failing, apart from random red/blue designation. Again - it doesn't happen as often now i've refined my mic settings.

Testing E-LM on the postal map. I had quite a few misrecognitions on one door at the offfice where you often come across the corrupt "fbi officer". The door to that room was giving me all kinds of misrecognitions...When my commands from before seemed to work well beforehand. Odd. There was a dead suspect right near the door? unsure if that's another potential quirk.
https://www.youtube.com/watch?v=xKwEUjsPFo8

Have another video showing same settings, same mic settings. more failures with recognition - because i was speaking/testing so much i couldn't speak properly by the point i recorded the video lol.

I have quite a few vids now showing a few quirks. Unsure if you've seen them all. https://www.youtube.com/@Madmaximus101/videos

Idea: for further context and understanding - might be good to link me a shared link with timestamp on a video you've watched for exact context if u see an issue. There might be some context i didn't explain properly.

Thought i'd point out something. The word "mirror" how does the model expect to hear it? Does the model expect to hear a more american sounding Mirreerrr or an aussie Mirraa? The American worded mirror if spoken quickly literally just sounds like Mirrrrrrrer with a buttload or R's lol.

jwebmeister / tacspeak

Model test results - model 20240117 #23