Model test results - model 20240117

jwebmeister commented 8 months ago

Post test results + useful remarks here, ideally of both:

the new model (20240117), and
the base model (kaldi_model_daanzu_20211030-mediumlm)

, using the same test data, and using the default Ready or Not grammar module.

Useful remarks include:

specific words or phrases that were consistently misrecognised
rate of false positives / false negatives
subjective opinion, which model works better or worse, and in which areas

Important instructions:

It's important to manually review, clean and update retain.tsv with the correct rules + text, see example workflow near the end of these instructions
See this YouTube video
Please only include "normal commands" in the test data, please exclude "Freeze", etc.
- There's a script ./scripts/copy_retain_item_cmds_only.ps1 that can be used in PowerShell to copy only "normal commands" out of ./retain/ and into ./cleanaudio_cmds/
Please, if possible, only use the default _readyornot.py grammar module, or very minor modifications, i.e. no new words.
Example command to run test ./tacspeak.exe --test_model './cleanaudio_cmds/retain.tsv' './kaldi_model/' './kaldi_model/lexicon.txt' 4
There are a number of useful PowerShell scripts in the ./scripts/ folder related to cleaning up the retain.tsv and related .wav files.
A workflow I use for cleaning up the data after a play session:
- Open retain.tsv and go through each line, reviewing the rule and text
- At the same time, load into a playlist every .wav file in the ./retain/ folder in VLC media player on single file loop, pressing 'N' to move to next .wav as I read through each line of retain.tsv
- When there's a mismatch between the text vs the audio, but the rule is correct, I correct the text in retain.tsv to align with the audio.
- When there's a mismatch between the recognised rule (and/or option) vs the audio, I either A) update both the rule + text manually, or B) delete the line in retain.tsv, then when I'm done reviewing I run the list_wav_missing_from_retain_tsv.ps1 first to make sure I'm deleting the right files, then run delete_wav_missing_from_retain_tsv.ps1 script (option A is preferred, but hey we're all busy and life is too short to spend cleaning all the data).
- If the audio is so stupidly vague or garbled that I can't understand with my own ears and brain what I'm saying, I delete the line in retain.tsv, then when I'm done reviewing I run the list_wav_missing_from_retain_tsv.ps1 first to make sure I'm deleting the right files, then run delete_wav_missing_from_retain_tsv.ps1 script.

Example report:

0 incorrect commands out of 4 cmds (1 missions played), same result both models
5% WER, same result both models
new model more often picks up baby crying as "freeze", using "listen_key_toggle":-1, using USE_NOISE_SINK = True; also picked up in base model but not as often.
New model tended to pick up "red" as "gold" when wife was speaking
using default _readyornot.py without any modifications
'./kaldi_model/' is new model
'./kaldi_model_base/' is base model

('./kaldi_model/', './retain/retain.tsv', 'Command', 'WER', 'Overall -> 5.00 %+/- 9.55 %N=20 C=19 S=1 D=0 I=0') ('./kaldi_model/', './retain/retain.tsv', 'Command', 'CMDERR', {'cmd_not_correct_output': 0, 'cmd_not_correct_rule': 0, 'cmd_not_correct_options': 0, 'cmd_not_recog_output': 0, 'cmd_not_recog_input': 0, 'cmds': 4}) ('./kaldi_model_base/', './retain/retain.tsv', 'Command', 'WER', 'Overall -> 5.00 %+/- 9.55 %N=20 C=19 S=0 D=1 I=0') ('./kaldi_model_base/', './retain/retain.tsv', 'Command', 'CMDERR', {'cmd_not_correct_output': 0, 'cmd_not_correct_rule': 0, 'cmd_not_correct_options': 0, 'cmd_not_recog_output': 0, 'cmd_not_recog_input': 0, 'cmds': 4})

jwebmeister commented 8 months ago

Edited with instructions / guidance on an example workflow on how to clean up the data after a play session, and point to the helpful ./scripts/ folder, to make data cleaning a slightly easier slj.

madmaximus101 commented 8 months ago

Currently testing the Experimental version. When speaking commands for fall in & arrest em/them/him/her blue team will be told to arrest often, sometimes red, but mostly blue. If i speak extremely clearly and with proper wording this doesnt seem to happen. But i do have to be "like a tv presenter" announciating words very properly. The Restrain command has less of this wierdness but it is still present. Again like before Blue team will be often unprompted to be given the command, with red sometimes being told to do it. The restrain command is less prone to this though which is interesting. I have noticed saying c4 in this model doesn't really work anymore. If i say c2 it pretty much works as expected. In general i noticed a habit of blue being told to do commands when i just said the command without red or blue at start of speaking. I am attempting to compile a test run with Powershell but i am getting an error: argument --testmodel: expected 4 arguments. edit: nvm...forgot the "4" lol Capture

jwebmeister commented 8 months ago

Awesome! Thanks for the feedback @madmaximus101 . A few questions and comments below.

Currently testing the Experimental version. When speaking commands for fall in & arrest em/them/him/her blue team will be told to arrest often, sometimes red, but mostly blue. If i speak extremely clearly and with proper wording this doesnt seem to happen. But i do have to be "like a tv presenter" announciating words very properly. The Restrain command has less of this wierdness but it is still present. Again like before Blue team will be often unprompted to be given the command, with red sometimes being told to do it. The restrain command is less prone to this though which is interesting.

Do you have listen_key set to a loud or clicky key? I have a theory that it’s picking up some noise as “Blue” before you start speaking. It might also be retaining some audio before you press the listen_key, which would be a problem I’d need to fix in code if it’s the case (though I thought I already fixed it!). It could also be the model but I want to narrow down possibilities. Can you try testing with listen_key_toggle=2, and try to speak the same commands that were having issues, speaking in the same manner, see if the problem persists?

You shouldn’t need to speak as a TV presenter for it to work accurately, if you do there’s something wrong.

I have noticed saying c4 in this model doesn't really work anymore. If i say c2 it pretty much works as expected.

Is C4 active in your grammar module? It isn’t by default. If it is a valid command in your grammar module, but it’s not being recognised, please confirm / let me know.

madmaximus101 commented 8 months ago

My listen on/off key is set to my mouse thumb button it is not really noisy. i have noticed however if i breath or sigh, or if I'm typing away it will recognise noises and attempt to decode them. if i don't want any listen padding at start and end or any automatic voice on/off feature which setting do i change?

edit: will try listen key toggle 2.

edit: i am using the experimental model as provided.

jwebmeister commented 8 months ago

if i don't want any listen padding at start and end or any automatic voice on/off feature which setting do i change?

@madmaximus101 listen_key_padding_end_ms_max and_min are options you can change to set the amount of audio captured after releasing the listen_key. There shouldn’t be any audio prepended before you press the listen_key (for listen_key_toggle 0 and -1), but if you’re sure it is prepending audio, let me know.

If listen_key_toggle is set to -1, it will always be listening for either YellFreeze or NoiseSink, so it’s fine for it to decode noises, as long as it doesn’t Yell without you saying “freeze” (or similar)… unfortunately it will likely yell at noises at least sometimes unless you have a very quiet environment and good mic. Just let me know if it’s truly unplayable and if it’s worse than the base model.

madmaximus101 commented 8 months ago

Listen key toggle 2 seems to be better, hot mic always on seems to be a much better experience overall, no random ghost or added-on commands, much higher success rate in general, i did have some misheard commands, either due to not being clear enough or i assume to quick speaking. With some of the retained audio i've noticed i seem to have a tendancy to breath in or make an innitial "opening mouth sound" as i click the hot mic button or just after i did. May have to learn to not do that lol. I also noticed my mic volume was way up. That might be a contributing factor also - potential for minor distortion of sound to ruin things etc. Will lower mic volume lol.

Edit: for reference my headset is the sennhieser gsp 670. I'd say it's better than average quality for sure.

Here is a link to a gameplay session using the Listen key toggle 2 hot mic always on. Experimental Kaldi model as provided. Edit: as of right now HD quality is still being uploaded. Text on screen will be hella blurry until it finishes - 30-45mins. https://www.youtube.com/watch?v=Mxzgd5aaR4Y

I am attempting to run the test thing in Powershell. I think i'm running the command correctly & nothing is happening? the command runs, but i get no results, output or files generated? Capture

jwebmeister commented 8 months ago

I am attempting to run the test thing in Powershell. I think i'm running the command correctly & nothing is happening? the command runs, but i get no results, output or files generated?

@madmaximus101 check in the retain.tsv, are the referenced file paths to the .wav files correct? e.g. ./cleanaudio_cmds/retain-123.wav

madmaximus101 commented 8 months ago

i downloaded the tacspeak app & kaldi model as is. Haven't changed anything. If I forgot an instruction in regards to these needing filepaths modified i apologise.

Pic of my retain.tsv Capture

I noticed in your powershell window. The '. after test_model were grey. In my powershell window the '. after test_model is blue. Thought i'd point that out just incase that means anything.

pic of my user settings. Capture

jwebmeister commented 8 months ago

@madmaximus101 You’re running the command with ./cleanaudio_cmds/retain.tsv whereas it should probably be ./retain/retain.tsv

It doesn’t matter what the path is as long as:

./somedir/retain.tsv is a valid file and path
the .wav file paths in the retain.tsv are valid files and paths.

madmaximus101 commented 8 months ago

Ahh, a simple filepath error. I copy pasted the example command given without a second thought 😅.

Appreciate the patience & help mate.

On Tue, 23 Jan 2024, 10:05 pm Joshua Webb, @.***> wrote:

@madmaximus101 https://github.com/madmaximus101 You’re running the command with ./cleanaudio_cmds/retain.tsv whereas it should probably be ./retain/retain.tsv

It doesn’t matter what the path is as long as:

./somedir/retain.tsv is a valid file and path

the .wav file paths in the retain.tsv are valid files and paths.

— Reply to this email directly, view it on GitHub https://github.com/jwebmeister/tacspeak/issues/23#issuecomment-1905969329, or unsubscribe https://github.com/notifications/unsubscribe-auth/BEXH6Y7H7AZW2D2WDPLFT6LYP6VBZAVCNFSM6AAAAABCDGEX5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBVHE3DSMZSHE . You are receiving this because you were mentioned.Message ID: @.***>

madmaximus101 commented 8 months ago

ok, i am at a point where i feel i am ready to start doing the initial collecting of data in somewhat of an organised manner, will start collating data and using the scripts etc. Is there a map in particular you would like me to play to reference for your own comparisons to make things easier? As less variables as possible etc. Any particular commands or ways of speaking you would like me to try?

madmaximus101 commented 8 months ago

@jwebmeister

Some initial notes & basic observations so far with some newly found quirks, yay! 😛

Lowering mic gain (I had my mic gain set to silly levels (in-game) for some reason, i don't remember doing it or why i would lol? it is now at 100%) seems to have helped a lot with accuracy of words & randomly detected noise attempting to be decoded. I no longer get the tapping of my keyboard or loud sighs, or mouth noises being picked up & tacspeak trying to decode it. I apologise for missing the stupid high mic gain levels with my earlier testing & wasting your time with that - my bad.

I am the only person in my house with 2 cats. I was using Listen key toggle -1 when it was their dinner time, they were meowing right underneath my chair & the Noise sink feature didn't detect it at all & there were no noises attempted to be decoded or false commands given.

Something of note a weird quirk i'm assuming with how Ready or Not is designed with any Kaldi model I've used. Some of the speech commands were completely different from what i said. I should of realised this instantly as I've pointed this exact issue out before lol. Took me a bit to figure out what the F**K was happening. I was seriously HUH!?? I figured out these random commands being executed different to what was spoken were happening when I was accidentally looking through multiple doorways. The commands given I guess was tacspeaks interpretation of what command I was attempting to give with what was available to be given. This quirk also happened when interacting with an ajar door. I first noticed this particular quirk when I was attempting to give a command to mirror the door when i didn't realise it was ajar. I then tested further & found any door command requiring physical interaction or placement of a device (wedge, c2, mirror) when a door was ajar would result in blue or red being given a command to stack up, or another basic command such as fall in or cover me sometimes breach & clear would be executed.

Is there a way to make a command so when a command for wedge, mirror, c2 is given on an ajar door. The operator closes the door & if possible continues the intended command(s) given?

I have also noticed another door quirk again i think due to Ready or Not's design. If I command blue team or red team to stack up on a door & then command the other team to stack up on the same door. It defaults the command to the team already stacked up on the door, so nothing happens. It also defaults any given door related command whilst looking at the door to the team that is currently stacked up to said door. Example: Blue team stack up, I then would speak Red team breach and clear. Blue team would breach and clear.

Edit: in hindsight i realise i should of tested this with other commands such as fall in, on me, cover me, commanding one team to do these commands whilst looking at the door the second team was stacked up on. Will update comment with this result if you would like that.

Testing the listen key toggle 2 setting with the experimental kaldi model I've noticed if I stop briefly then continue it will give an unintentional command mid-sentence (blue team "slight pause" breach and clear) I am attributing this to my own cautious bias & my own learnt speech habits interacting with Tacspeak. When I speak in one fluid continuous sentence it does seem to work, although listen key toggle 2 setting does sometimes pick up my random mouth noises when i sigh louder than normal, or if I make a "tutting noise" haha. This issue is very, very much reduced with mic gain now at 100%, basically almost a non-issue at this point. If i mucked up speaking a command (brain fart) with Listen key toggle 2 & a command was given i didn't intend I would speak "fall in". Depending on my level of panic or quick speaking this sometimes worked & sometimes didn't. Listen key toggle 2 I found requires you keep your speech in check 😅

The Listen key toggle -1 setting with the experimental Kaldi model also seems to have less unintentional noises detected with my mic gain now at 100%. If I mucked up a command (brain fart) i would let go of mouse thumb button which would lessen the impact of the error. I would then press thumb button again & issue the fall in command to stop the command currently happening which would come out correctly. I know there is a halt, cancel, stop command but my brain just thinks of fall in in the moment lol.

madmaximus101 commented 8 months ago

@jwebmeister Noisesink seems to work as intended with high accuracy in the few times it's activated since i've corrected my mic gain to 100%. If I do make a noise detected by Noisesink such as a burp, cough, or i hit the desk it activates.

I will test Noisesink with words & phrases a person might say in surprise, fright, dissapointment or anger.

madmaximus101 commented 8 months ago

@jwebmeister newly found quirks aside i am of the opinion the Listen key toggle -1 setting is pretty good and pretty much working as intended...now that my mic gain is at appropriate levels - again i apologise.

I will test this setting further whilst watching out for where i'm looking when giving said commands.

madmaximus101 commented 8 months ago

The "F word & F you" are often picked up as

on_recognition (INFO): KaldiRule(16, ReadyOrNot_priority::YellFreeze) | drop Freeze!

madmaximus101 commented 8 months ago

i have some results of the testing here.

There was only one mistake out of the short run of commands i did here in-game as a little test run to make sure things were running as they should with things.

I Wasn't sure how to change this in the text files to reflect the result so i will explain.

The command recorded red team secure area I actually said red team kick and clear. I said this a second time more clearly and it gave the correct command.

test model output overall test model output token test model output utterances test model output utterances2 Retain

madmaximus101 commented 8 months ago

figured it out. test_model_output_overall.txt test_model_output_tokens.txt test_model_output_utterances.txt

jwebmeister commented 8 months ago

Thanks @madmaximus101

I Wasn't sure how to change this in the text files to reflect the result so i will explain. The command recorded red team secure area I actually said red team kick and clear. I said this a second time more clearly and it gave the correct command.

Open retain.tsv
Change "GroundOptions" to "BreachAndClear", for the highlighted error
Change "red team secure area" to "red team kick and clear", for the highlighted error
Save retain.tsv, overwriting the existing file

The "F word & F you" are often picked up as on_recognition (INFO): KaldiRule(16, ReadyOrNot_priority::YellFreeze) | drop

I don't think there's an easy fix, other than re-training the model, and my previous attempts to do just that didn't result in any improvements. However, current options or work-arounds are:

accept some noises and exclamations will occasionally cause your character to yell, or
change listen_key_toggle to 0 or 2 in ./tacspeak/user_settings.py (1 also, but I personally don't recommend it)

Is there a way to make a command so when a command for wedge, mirror, c2 is given on an ajar door. The operator closes the door & if possible continues the intended command(s) given?

Not without first specifying via speech that the door is ajar, e.g. "wedge the ajar door" instead of "wedge the door". Similar to the multiple "door", "doorway", "hallway" issue, it's a problem more effectively solved from the game devs (Void) side of things, as implementing a workaround from tacspeak will reduce speech recognition accuracy. I'll consider revisiting this if there's no updates from Void that address some of the command menu quirks.

I have also noticed another door quirk again i think due to Ready or Not's design. If I command blue team or red team to stack up on a door & then command the other team to stack up on the same door. It defaults the command to the team already stacked up on the door, so nothing happens. It also defaults any given door related command whilst looking at the door to the team that is currently stacked up to said door. Example: Blue team stack up, I then would speak Red team breach and clear. Blue team would breach and clear. Edit: in hindsight i realise i should of tested this with other commands such as fall in, on me, cover me, commanding one team to do these commands whilst looking at the door the second team was stacked up on. Will update comment with this result if you would like that.

I thought I was going crazy, thank you, this explains quite a lot. I only ran into this issue when playing Ides of March (so far). If you've tested it, or willing to test it, can you confirm what the extend is of it changing the team selection, what commands it affects outside of just breach and clear?

Testing the listen key toggle 2 setting with the experimental kaldi model I've noticed if I stop briefly then continue it will give an unintentional command mid-sentence (blue team "slight pause" breach and clear) I am attributing this to my own cautious bias & my own learnt speech habits interacting with Tacspeak.

The same thing happens with my speech. If you change listen_key_toggle to 2, I suggest also changing vad_padding_end_ms to 250. Otherwise experiment with values for vad_padding_end_ms, this setting helps determine when enough silence has been detected to end the utterance, attempt recognition, and execute commands.

jwebmeister commented 8 months ago

Here is a link to a gameplay session using the Listen key toggle 2 hot mic always on. Experimental Kaldi model as provided. Edit: as of right now HD quality is still being uploaded. Text on screen will be hella blurry until it finishes - 30-45mins. https://www.youtube.com/watch?v=Mxzgd5aaR4Y

@madmaximus101 cheers for the video, it's extremely helpful.

Based on the video, Tacspeak and/or the experimental model isn't performing "good enough" imho (though I also need more test data). As you said, you're having to speak as a newscaster for it to be reliably accurate, and there were some commands spoken that were misrecognised for no good reason that I could determine, e.g. "on me" was recognised as "team remove wedge"!?

It failing to recognise C4 as C2 (or written out as "c two") is reasonable to me as it's not a valid command, unless the grammar module has been explicitly changed to recognise "c four" as an option... then I definitely want to know about it.

In hindsight I realise there's probably too much manual effort required from testers to get good test data (as opposed to being an automatic process). For example, the test data will only show misrecognitions if the user manually cleans and updates the data, and the test data won't show failed recognitions (i.e. not recognised commands) unless the user mentally notes it or records the full play session. I haven't got any good ideas on how to fix this however.

Have you tested the base model? Does it do better / worse than the experimental model?

madmaximus101 commented 8 months ago

Long weekend coming up. I have fixed up my mic gain issue & will do more testing of both models to have a proper comparison. My posts were a tad jumbled & not really consistent haha.

Based on the video, Tacspeak and/or the experimental model isn't performing "good enough" imho (though I also need more test data). As you said, you're having to speak as a newscaster for it to be reliably accurate, and there were some commands spoken that were misrecognised for no good reason that I could determine, e.g. "on me" was recognised as "team remove wedge"!?

I might have given some unfair results with my not knowing of stupid mic gain levels & not really being aware of speech issues & quirks with my earlier posts/results. I will re-do my testing in a more thorough manner now that quirks & specific issues have been identified.

I thought I was going crazy, thank you, this explains quite a lot. I only ran into this issue when playing Ides of March (so far). If you've tested it, or willing to test it, can you confirm what the extend is of it changing the team selection, what commands it affects outside of just breach and clear?

My current idea to be most helpful towards you atm with the things needing further clarity or discovery is recording video deliberately testing these issues/quirks to see what is possible/not possible/quirk/error etc - Giving a link to the video along with description of how things went aswell as the results with testing the retain.tsv.

What sort of things are you looking for or want cleaned in regards to audio? I have a pretty quiet house as it's just me so there is not often any random noises generated apart from maybe my own speech quirks and mouth sounds.

I am also thinking maybe i can put together an edited video of sorts displaying things comparing commands with different model. "same scenario, same commands, same doors - different model". Switching between models but using the same commands as video progresses. I can acquire some editing software easily.

Have you tested the base model? Does it do better / worse than the experimental model?

In general I do have a sense that the medium model has less errors & i feel i am able to talk normal without feeling the need to be cautious with my speech. The large model is even more so like that. I havn't used the bare bones base model suggested in the main tacspeak page in a while.

It failing to recognise C4 as C2 (or written out as "c two") is reasonable to me as it's not a valid command, unless the grammar module has been explicitly changed to recognise "c four" as an option... then I definitely want to know about it.

When I breach & clear with the command for "c4" with the medium model or the large model It works pretty reliably, unsure if this is because of pure luck & it consistently recognising "c4" as "c2" or whether the language model has some sort of deliberate word detection for that specific thing. I can't remember it not working. Why I seem to have a habit of saying c4 instead of c2 lol.

In hindsight I realise there's probably too much manual effort required from testers to get good test data (as opposed to being an automatic process). For example, the test data will only show misrecognitions if the user manually cleans and updates the data, and the test data won't show failed recognitions (i.e. not recognised commands) unless the user mentally notes it or records the full play session. I haven't got any good ideas on how to fix this however.

I would be willing to learn things, have always wanted to learn python, never had a reason to - this peaks my interest very much. I would also be willing to do some Speech training, is this something i can help with? I've also noticed there is a training folder in the experimental Kaldi. Does that have something to do with the data collection & modifying/cleaning up data?

Open retain.tsv

Change "GroundOptions" to "BreachAndClear", for the highlighted error

Change "red team secure area" to "red team kick and clear", for the highlighted error

Save retain.tsv, overwriting the existing file

Thanks for the info, this will def help processing data on the next set of retain audio info i gather, thankyou!

Not without first specifying via speech that the door is ajar, e.g. "wedge the ajar door" instead of "wedge the door". Similar to the multiple "door", "doorway", "hallway" issue, it's a problem more effectively solved from the game devs (Void) side of things, as implementing a workaround from tacspeak will reduce speech recognition accuracy. I'll consider revisiting this if there's no updates from Void that address some of the command menu quirks.

Is there a way for a spoken command to be deliberately denied or stopped if what was spoken is very wrong from the expected command say when someone might be accidentally looking through multiple open doorways? Possibly a system created where tacspeak automatically implements a stop command in a situation where there is a massive difference between spoken & executed commands.

madmaximus101 commented 8 months ago

I have figured out how to change what model Tacspeak is actively using in the usersettings file. I was manually changing out the folders to do this lol. Does changing the usersettings file in the manner affect results or skew things?

My current method atm is to have entirely seperate folders of each iteration/test/result/attempt using tacspeak to completely seperate & have a visual indication of literal seperation of datasets. Usersettings Kaldi_model_base

jwebmeister commented 8 months ago

What sort of things are you looking for or want cleaned in regards to audio?

@madmaximus101 A direct comparison between the base model (I mean the medium lm model when I say base) and the experimental model. What works well in one but not the other is what I'm most concerned with. In regards to actual commands or gameplay, no idea, just everything, as much regular play as possible.

In general I do have a sense that the medium model has less errors & i feel i am able to talk normal without feeling the need to be cautious with my speech. The large model is even more so like that.

I need to quantify it, and I need to test it using other people's speech other than my own. Please if you can, run the tests on the same retained data using:

the experimental model, and
the base (medium lm) model, and
(optional, for extra credit) the large lm model.

When I breach & clear with the command for "c4" with the medium model or the large model It works pretty reliably, unsure if this is because of pure luck & it consistently recognising "c4" as "c2" or whether the language model has some sort of deliberate word detection for that specific thing.

The finetuning in the experimental model seems to have grossly skewed the word probabilities. This means that there's a larger difference between "c two" and other words including "c four" in the experimental model than the base model. This should both make it more accurate and precise, but also less lenient.

I would also be willing to do some Speech training, is this something i can help with?

Not yet, otherwise we'd both be wasting our time. It's unfortunately not as easy as just tweaking the training values to try to balance it, so I need hard test data to focus in on where specifically the models falling down, as an indicator of where part of the training process is falling down (this is my focus, much more so than just fixing the model).

At the end of this experiment, a very possible conclusion is that there's no practical benefit to finetuning the model (in fact I have SME advice saying exactly that), and that you'd need to train the model from scratch to see any real benefit. If this is the conclusion, hard test data would be of even greater benefit, as a model from scratch should be even more sensitive to the training process and data put into it.

madmaximus101 commented 8 months ago

I need to quantify it, and I need to test it using other people's speech other than my own. Please if you can, run the tests on the same retained data using:

I need to quantify it, and I need to test it using other people's speech other than my own. Please if you can, run the tests on the same retained data using:

the experimental model, and

the base (medium lm) model, and

(optional, for extra credit) the large lm model.

Ok got it. It just clicked (lightbulb moment) the retained audio files don't change. The A.I does. Makes sense.

jwebmeister commented 8 months ago

Is there a way for a spoken command to be deliberately denied or stopped if what was spoken is very wrong from the expected command say when someone might be accidentally looking through multiple open doorways? Possibly a system created where tacspeak automatically implements a stop command in a situation where there is a massive difference between spoken & executed commands.

Issue #14 , requires support / integration from Void.

Alternatively, for a flub while speaking, there could be a key phrase to just change the command action to noop (do nothing), e.g. "\<dictation> (s- | f-) I messed up". I deliberately haven't tried it or put it in because I believe it's very likely to negatively affect speech recognition accuracy, e.g. a valid command + some noise at the end = noop instead of a valid command. Having said that, it might be worth experimenting, I just have had other priorities.

I have figured out how to change what model Tacspeak is actively using in the usersettings file. I was manually changing out the folders to do this lol. Does changing the usersettings file in the manner affect results or skew things? My current method atm is to have entirely seperate folders of each iteration/test/result/attempt using tacspeak to completely seperate & have a visual indication of literal seperation of datasets.

That's more effort than I put in! I've just been renaming the model folders, for no good reason, but the user_settings should work if you're running tacspeak.exe without additional arguments.
You don't need to change user_settings or folder names if you're running --test_model as you're already specifying which model directory to use in the arguments, e.g. ./tacspeak.exe --test_model './retain/retain.tsv' './kaldi_model/' './kaldi_model/lexicon.txt' 4

Ok got it. It just clicked (lightbulb moment) the retained audio files don't change. The A.I does. Makes sense.

Yep. Ideally playtest with each model for at least a few mission. Then run --test_model using each model on all of the data retained from the playtests (including the playtests where the same model wasn't used). Hopefully that makes sense.

madmaximus101 commented 8 months ago

That's more effort than I put in! I've just been renaming the model folders, for no good reason, but the user_settings should work if you're running tacspeak.exe without additional arguments.

it's more i got annoyed with having to copy/paste/delete/change folder names to use tacspeak with the model i was wanting to use. This way i don't have to chop & change folder names or move folders around to use tacspeak in-game with a different model lol.

Yep. Ideally playtest with each model for at least a few mission. Then run --test_model using each model on all of the data retained from the playtests (including the playtests where the same model wasn't used). Hopefully that makes sense.

All voice data collected during gameplay. Delete audiofiles containing mistakes or mispoken words/obvious errors - aswell as delete the corrosponding entry in the associated files along with it. The earlier post where you instructed further on the retained.tsv thing will help with this. WIll comment for further assistance if i get stuck on this again.

madmaximus101 commented 8 months ago

Am currently playing idles of march - each video i will be recording for basic at face value/assessment will be using a diff model.

Something i have potentially picked up.

The restrain command sometimes does not work correctly, i've had this issue in various levels of error regaurdless of model. When it doesn't work correct it's often followed by a move here command, or fall in. I believe this to be caused by the actual restrain command only being able to be issued by having to mouse over a very particular spot on the npc in question aswell as what i think to be a distance activated thing aswell.

Another thing i have picked up

If you tell a team member to mirror the door, wedge the door, c2, gas etc. Sometimes this command will designate red or blue to innitiate the command instead of gold. At first i was like huh....this command would often be repeatable with the same result...then it hit me...the team in question that gets designated to fullfill the command are the only ones with said device...so of course it will either default the command to team with device or will just be designated as gold team ie: Not a problem, will need to investigate this further to confirm. Same can be said for removing of devices. If gold team is current selected team & i issue a command to remove a device from a door & red or blue ends up being designated for the command instead. I figured out this is because red or blue most likely has the maximum amount of said devices in the tactical pouch/slot SO of course "gold team" - current team command i just issued will sometimes get issued as red or blue - ie: not a problem! - I will test this further to confirm.

guess what - another thing :D

If looking at a door & issuing on me, fall in. The command breach & clear will be executed.

madmaximus101 commented 8 months ago

Have made 3 videos depicting E-LM M-LM & B-LM. I almost went for editing the vids into one homogeneous vid, but my brain didn't like the idea after all lol. Will be uploading shortly with descriptions & general info of each vid's happenings & quirks.

All on Idles of March map.

The erroneous red/blue designation of tasks seems to be limited to the E-LM model. Overall, my findings are that the M-LM & B-LM are much more stable speech recognition wise. Across the board there are missteps & wrong commands given even with the M-LM & B-LM this can be imo attributed to not looking at the exact spot intended in the exact moment the command was given i.e.: Not looking exactly at spot to arrest suspect, not looking exactly at door, accidentally looking through multiple doorways.

The E-LM model does seem to have a few errors - there is no denying that. What was/is the goal for the E-LM model? To have a custom bespoke speech recognition exactly/specifically designed for Ready or Not & Tacspeak? Smaller file size overall?

If there is some sort of specific design choice/pathway/idea for the E-LM I would be willing to brainstorm or help with further refining the idea. I'm definitely nowhere near your level of knowledge with coding though so I wouldn't be able to help with that aspect.

I do have a good problem-solving brain lol. fixing up cars, electrics, I.T, networking (Unraid mostly), all self-taught etc - giving u context is all 😀

From what I understand of your previous comment with no point bothering with further speech training on the E-LM if it turns out it's a bust. There's no point speech training the E-LM if the backbone of the A.I Speech recognition of the E-LM is too strict or not as..."flexible?" in the first place?

madmaximus101 commented 8 months ago

I am now currently testing to take into account & correct & or fiddle with settings to adjust for my own "bad-habits" in relation to Tacspeak - currently testing experimental model.

Edit: I have a theory to an issue I'm currently testing. I think one of my bad habits is i have a habit of speaking either just as I press the speech button or just before it - potentially causing issues with the recognition which i think could be the cause of the erroneous blue/red command designation. This doesn't seem to be as big of an issue or a noticeable thing with the MLM or BLM.

Edit: I have changed this setting highlighted to see if this has a positive effect - shortening the amount of time before speech can be detected. Is this the correct setting for what i think it is? talking to early

Edit: I have noticed the E-LM has trouble with "Mirror the door" in general aswell as "on me" sometimes "fall in" aswell

madmaximus101 commented 8 months ago

Potential solution to users not being proficient in correctly sorting/refining/cleaning & getting good data.

Upload entire tacspeak folder to google drive with all data intact?

jwebmeister commented 8 months ago

Potential solution to users not being proficient in correctly sorting/refining/cleaning & getting good data. Upload entire tacspeak folder to google drive with all data intact?

No. I don’t need or want anyone to upload their speech data anywhere. I only need the overall test results and any specific findings on what words the experimental model gets wrong that the base model gets right.

I have changed this setting highlighted to see if this has a positive effect - shortening the amount of time before speech is recognised. Is this the correct setting?

It shouldn’t have an effect, or at least not a positive one, that setting is related to the voice activity detector. There isn’t really a direct setting to intentionally capture audio before you press the listen_key.

What was/is the goal for the E-LM model?

It’s a test of the model finetuning / training process, to figure out what part of the process needs to be adjusted and/or if it’s (or which areas are) worth further investment of time and effort.

There are a number of things I can try to address some of the issues already identified, but I need hard data to narrow it down to specifics, so that I’m not wasting my time. All of the potential fixes will take a great deal of time and effort, beyond what I’ve already put in.

There aren’t any design decisions to be made until the finetuning and training process + code is 100% “working”. The most helpful that can be provided right now is test data. After that I can prioritise tasks and put together a plan of attack, doing so before gathering and reviewing test data is a waste of time.

madmaximus101 commented 8 months ago

I am attempting to run the scripts to tidy up things - keep getting this error? Capture

jwebmeister commented 8 months ago

I am attempting to run the scripts to tidy up things - keep getting this error?

The easiest workaround is to run powershell as administrator. Otherwise check out this article

Make sure to run the relevant “list” script first before running any “delete” scripts to make sure only the correct items will be deleted. There’s no undo with powershell.

Edit: also run the scripts from the same directory as tacspeak.exe and where the “retain” folder lives, e.g. ./scripts/some_script.ps1.

madmaximus101 commented 8 months ago

These are my first proper test videos with E-LM M-LM B-LM on the Idles of March map.

This is the E(Experimental)-LM model being used in this video.

https://www.youtube.com/watch?v=3qDAMdt_v_k This is where I originally identified a consistent issue with "mirror the door", "fall in", "on me" Sometimes telling blue or red to do it, or sometimes executing a wedge command with red or blue. I've also noticed some quirks with trap commands & wedge commands. Specially if there is a trap or a wedge already on the door sometimes with this model.

This is the M-LM model (kaldi_base_model) being used in this video.

https://www.youtube.com/watch?v=1fxtZCWRs3w&t=635s This was a much smoother speaking experience, again experiencing some quirks with different commands being issued. Example - Looking at npc surrendered & saying "move here" with the restrain command being given. This is one instance of myself experimenting with what would happen if i said a different command with what was available in the command list. I also experienced some quirks here with commands being seemingly correctly heard & executed but nothing happening, which is then usually followed by myself issuing a fall in command to "reset" them so the command will be executed. I think this is due to me issuing a command when they are currently in the middle of something, or they are temporarily physically blocked from following the command.

This is the B-LM model being used in this video

https://www.youtube.com/watch?v=o4niN0lOiVg&t=211s overall, I'd say this model is the most error free & most quirk free experience. In this video u can see a clear example of telling the team to arrest the npc but the move command being given. Also, a very clear example of giving a command through a doorway & the stackup command being given, sometimes this results in breach & clear.

I had a suspicion the arrest/restrain command wasn't being issued because I wasn't exactly moused over the exact point of where the restrain command can be given & indeed it is a "mousing over the exact point required for restrain command" issue. When moused over the correct point for restrain to be activated it becomes top of the menu. instead of the door commands being at the top. There is currently no "sub-menu" to navigate to the restrain command if the door command menu is at the top. I feel this is in general a Ready or Not issue overall. I imagine people who play Ready or Not without a speech mod experience the same frustrations. Screenshot (5) Screenshot (6)

madmaximus101 commented 8 months ago

I thought I was going crazy, thank you, this explains quite a lot. I only ran into this issue when playing Ides of March (so far). If you've tested it, or willing to test it, can you confirm what the extend is of it changing the team selection, what commands it affects outside of just breach and clear?

I will get onto this, provide video & screenshots if possible. I will find a spot on idles of march that can give the error & then attempt it on other maps. I will test this with the other models also.

madmaximus101 commented 8 months ago

@jwebmeister

Test Results for E-LM M-LM B-LM .txt files provided

If you would like, for further context i can edit the names of each audio file & take a screenshot so you have context for what the commands were/are in order. This way i can communicate what the audio files said vs what the test spits out. Is there any other results or things i can give data wise im unaware of?

This took me a while to get to this point. I decided it was easier for myself to make clean audio from the start, no muckups, no mistakes, no verbal garbage. Attempting to have no noise be picked up or no accidental freeze or yell. This is harder than i thought haha but i got there.

speaking "gold" sometimes will result in the command halt being given. Even when using the B-LM.

madmaximus101 commented 8 months ago

I have discovered the crux of the issue with the quirk relating to commanding one team to do something but the other team does it instead. This was tested with the B-LM to make sure it was indeed a RoN issue.

Red team stacked up

https://www.youtube.com/watch?v=Yxb3NznJFi4

Blue team stacked up

https://www.youtube.com/watch?v=WNpaVtaM72M

It seems that when red or blue is stacked up that team "takes ownership" of the door if that makes sense? So when looking at said door "claimed" by red or blue the team currently stacked will be the team that follows the command even if you specified the other team to do the command.

If gold team is stacked up red or blue can be told to breach & the other team will back off.

I believe this quirk to be limited to doors/doorways & hallways where a "hidden door" in the middle of a hallway exists like on Idles of march.

Edit: Investigated this issue further

https://www.youtube.com/watch?v=yvEQ_PVDoP0 I have found out that if you tell red or blue team to breach & clear that team will take ownership of the door/doorway/hallway until that team has finished clearing the room. With some areas being quite large. To someone who isn't aware of this it can come across as WTF!??? Which may contribute to the issues of unexplained red/blue designation of tasks when looking in the direction of said door/doorway/hallway with "hidden door". At this point I'd say there is a need for a mod to be created to eliminate this "feature" entirely.

Maybe a mod that makes all commands available regardless of where your looking, what team is doing what, in one big "command-tree"? that always stays the same. Hypothetically I could see this maybe making Tacspeak usage & commands potentially quirk free?

madmaximus101 commented 8 months ago

I've looked into how to go about setting up the speech training stuff to add to my own tacspeak. Wow....it's alot.

madmaximus101 commented 8 months ago

@jwebmeister

My posts & finding/results have been rather sporadic & all over the place. Apologies for that, I know it probably wasn't too helpful for proper data.

You could consider my posts here a gradual journey of myself discovering & learning as I go.

jwebmeister commented 8 months ago

@madmaximus101 thanks for testing bud. You’ve done infinitely more than anyone else!

I wrote a longer comment but lost it due to router / ISP shenanigans so I’m just going to dot point it below.

it’s better to keep incorrect recognitions rather than deleting them, just correct the text and rule in “retain.txt” before you run tests
it’s better to run at least one mission with the experimental model retaining data, and at least one mission with the base model retaining data, and running tests on the combined dataset (after correcting any misrecognitions and cleaning any flubs) for each model
there’s a number of quirks with the Ready or Not command menu that I’m ignoring (for now) to focus on what specifically the models affect. Other issues should be either: a) raised as new issues for Tacspeak (if relevant), or b) raised as player feedback to Void.
can you please summarise what words / phrases the experimental model gets wrong that the base model gets right? (And vice versa)
“gold” getting mixed up with “hold” can be addressed in the grammar module, e.g. remove “hold” as an alternative for “halt” in “map_ground_options”, though “hold position” should probably be broken out into its own command instead. New issue #24
I wasn’t exaggerating when I said the training pipeline wasn’t ready and it would be a waste of time to try it. It’s not stable, it’s a pain to get it setup, I have zero time to support anyone using it, and it has yet to prove that it’s worth the effort. Finetuning has shown improved WER, but it came with some pretty significant tradeoffs and issues. If the issues can be resolved while retaining the same level of WER improvement, then I’ll believe it’s worth the effort.

madmaximus101 commented 8 months ago

were the txt files i provided of the 3 models helpfull in someway? or did i provide the data in the wrong manner.

jwebmeister commented 8 months ago

were the txt files i provided of the 3 models helpfull in someway? or did i provide the data in the wrong manner.

@madmaximus101 They absolutely were, thank you.

The thing I noted was that there were only 33 commands, I average ~30-40 per mission, and that the medium and large lm models had 0% command errors, likely indicating only a single mission was run using just one of the models. I realise now I really need at least one mission run using the experimental model, and one mission run using the base model, and the tests run on the combined retained data. If that’s what you already did, my apologies, just want to confirm that’s the overall results.

Other than that, there’s also “things of note” that aren’t covered by the automated tests that you only get from play-testing. I want to make sure I’ve captured everything you’ve noted, make the job easier for myself while I review and re-review everything you’ve already noted, and give you an opportunity to add anything else that you might recall or anything you want to highlight.

madmaximus101 commented 8 months ago

your correct i did run one mission with one model and ran the data through the test thing. 1 mission run with each model. got it.

From what i remember of earlier posts. Go through and delete any data referencing noise sink. Aswell as any commands for yell or freeze? Along with their associated audio files? - I very rarely use the yell or freeze command anyway. If there are any misrecognised commands with audio correct them in the txt files.

madmaximus101 commented 8 months ago

I have upped my mic gain in-game from 100% to 110% to test if my misrecognised commands are volume related if at all.

jwebmeister commented 8 months ago

Go through and delete any data referencing noise sink. Aswell as any commands for yell or freeze? Along with their associated audio files? - I very rarely use the yell or freeze command anyway.

You shouldn't need to do this manually. "YellFreeze" and "NoiseSink" should already be excluded if you included in user_settings.py the setting retain_approval_func and set it to my_retain_func. Otherwise yes, delete YellFreeze and NoiseSink entries in retain.tsv + audio if they are being retained, there's powershell scripts available to help do this.

If there are any misrecognised commands with audio correct them in the txt files.

Yes please. Both the text and the rule in retain.tsv

I have upped my mic gain in-game from 100% to 110% to test if my misrecognised commands are volume related if at all.

In-game mic settings should have zero effect on Tacspeak. Windows Sound Settings and your physical mic gain (or interface if you use one) might affect things if it's near in-audible (or way too loud), but shouldn't if it's within normal range.

jwebmeister commented 8 months ago

@madmaximus101 if you’re willing / able to test, can you try playtesting a mission, start every spoken command with the correct team colour, listen back to the retained audio, see if the colour gets cut from audio? e.g. you say “blue team mirror the door”, but the retained audio is “team mirror the door”.

I’m not sure if this cut audio issue is a user issue (I pressed the listen_key too late) or a code issue, but I need further testing done and my machine is locked down at the moment.

Random side note: in testing I thought the model picked up silence as “blue” but listening back I could clearly hear “blue” spoken faintly, even though I was 99% confident I said nothing. I think I’m going crazy.

madmaximus101 commented 8 months ago

In-game mic settings should have zero effect on Tacspeak. Windows Sound Settings and your physical mic gain (or interface if you use one) might affect things if it's near in-audible (or way too loud), but shouldn't if it's within normal range.

actually... i do have one. The Epos gaming app - for my sennheiser gsp 670's. I do have alot of minor static and low level background noise. Wonder if the noise cancellation features or mic enhancement features of the app will improve things. I will test the blue/red cut off audio thing aswell.

Will test and get back to you.

madmaximus101 commented 8 months ago

@jwebmeister holy S**T mate your not going to believe this....it worked...very well. Just to be double sure. I will re-download the experimental version...juuuust incase.

The Inconsistencies with designating blue or red with mirror or wedge are still present. But very much reduced with my refined mic settings. Capture1

In this pic I've highlighted the audio file & the retain file reference. In the audio file, right at the very very beginning, right just before i talk there is some...I Dunno how to put it. Very minor static, very very minor distortion, almost like white noise, but in the background. I don't know how else to put it. Gold team was the current team & i said "mirror the door". Capture

In this highlighted example there was no static or distortion/white noise. I also spoke slightly louder - not by much though. Capture

But hey - overall was a much more improved experience! I will test with full noise cancelling & see what happens.

@jwebmeister

Vid showing E-LM with adjusted Mic settings with the Epos App.

Very much improved experience. https://www.youtube.com/watch?v=pVZH5h6mr5s

Pending results from 100% noise cancelling i may upload a second video and edit comment showing its results as well.

madmaximus101 commented 8 months ago

suggestion - is there a mirror command/wedge command wierdness due to...the...type? of door? does the wedge/mirror spoken verbally have anything to do not specifying the type of door in the command? or the command auto assumes a type of door? hence the wierdness? This red/blue wierdness is less common with the trap command. Just spitballing here. What im thinking probably isn't a thing if you're not having those issues.

jwebmeister commented 8 months ago

I will test with full noise cancelling & see what happens.

@madmaximus101 yep, please let me know how it goes. If it’s a significant improvement I’ll be surprised, but if so, it narrows down what I need to refine in the training data. There is some noisy data in the training dataset, but it’s not the whole dataset copied and perturbed like it is for speed and volume. Again, I’ll be shocked if it makes a significant difference.

suggestion - is there a mirror command/wedge command wierdness due to...the...type? of door? does the wedge/trap/mirror command being spoken verbally have anything to do not specifing the type of door? if indeed the door is different? Probably not a thing if your not having those issues.

I don’t know what specifically you mean. For it to select blue vs red? If the Tacspeak console says current team, or the correct spoken team, then it’s not an issue with the model. In general, if the Tacspeak console prints the right command, it’s not the models fault.

madmaximus101 commented 8 months ago

I don’t know what specifically you mean. For it to select blue vs red? If the Tacspeak console says current team, or the correct spoken team, then it’s not an issue with the model. In general, if the Tacspeak console prints the right command, it’s not the models fault.

nvm was thinking maybe different types of doors were named/coded a particular doortype. dont think thats the case - my bad.

jwebmeister / tacspeak