[speech-to-text] Word alternatives are never populated

kimberlysiva commented 6 years ago

SpeechRecognitionResult has a word_alternatives field that is never populated. Also, this should probably be an array of WordAlternativeResults, instead of a single result as it's currently set up.

mediumTaj commented 6 years ago

@kimberlysiva I have been updating Speech to Text parameters. Looks like in this case we never parsed out word_alternatives. I added to gh262-update-stt-parameters. Please try it out when you have the chance.

kimberlysiva commented 6 years ago

It's working great! My only quibble is that there is no way to turn it off. We may want an EnableWordAlternatives flag that can be used to omit this parameter. This should probably be disabled by default, let's keep the data lean unless a user needs something.

I don't seem to be getting interim results with this update... I might be missing something simple, but the default behavior of printing these results to the console has definitely changed.

Also, I'm guessing it's a work in progress but ExampleSpeechToText is outputting a bunch of compiler warnings, e.g. Assets/Watson/Examples/ServiceExamples/Scripts/ExampleSpeechToText.cs(32,23): warning CS0414: The private field 'ExampleSpeechToText._audioClip' is assigned but its value is never used

kimberlysiva commented 6 years ago

Ah-ha, here's the problem with interim results. SpeechToText line 917:

IList iwordAlternatives = iresult["word_alternatives"] as IList;
if (iwordAlternatives == null)
    continue;

word_alternatives is only present in the final result, so we're cutting out early on all the interim results. My event handler is getting called, but without any results! Easy fix is to encapsulate that chunk in a null check, like we do for keywords.

mediumTaj commented 6 years ago

ah oops, good catch. I'll revise. Yes all of the warnings are temporary - commented code while I test new operations.

mediumTaj commented 6 years ago

Revised, Please test when you have a chance. As far as enabling word alternatives, there is a WordAlternativesThreshold that you can set to 0 to omit word alternatives. At some point the service abstraction has it default to 0.5f. I changed it to the default value of 0.

kimberlysiva commented 6 years ago

Actually, if you set WordAlternativesThreshold to 0 you get a lot of results! That's basically saying no lower bound on the confidence score.

From the docs: "Specify a probability between 0 and 1 inclusive. No alternative words are computed if you omit the parameter."

mediumTaj commented 6 years ago

Ah ok - I revised and made WordAlternativeThrehold nullable and omitted by default.

kimberlysiva commented 6 years ago

Haha, you have me nervous with that nullable, but it's probably fine. Maybe you've used them elsewhere? When working in Unity I have a constant voice nagging in the back of my head, "will this work on platform xyz?" My code ends up very plain-jane because of this ;)

Looks like there was an issue with nullables on iOS at one point, but it's been fixed: https://issuetracker.unity3d.com/issues/assigning-null-to-nullable-type-assigns-default-value-on-ios

Anyways, I digress! I've tested these changes and they work fine. Thank you so much for all your hard work on this, I'm really excited by how well speech-to-text is working now :)

mediumTaj commented 6 years ago

No worries! Actually I'm gearing up for release for the Asset Store. Having more eyes on this is great! If you want to contribute anything to the release feel free to put up a pull request! That goes for you too, @RMichaelPickering - I know we need a better way to get Audio to the service than the current ExampleStreaming.cs file.

RMichaelPickering commented 6 years ago

@mediumTaj I've created a new version of the Streaming example that downsamples the audio to 16 kHz and streams it in small chunks. I do call this a better way! If you have some time to help test it tomorrow on the Cloud side I'd be happy to share the code, but I'm afraid that I don't speak GitHub right now so you'd need to be willing to do a manual merge for me.

mediumTaj commented 6 years ago

@RMichaelPickering I would love to take a look at it - Send it over to me

kimberlysiva commented 6 years ago

@mediumTaj If you put this into a branch let me know, I'd love to test it too. Thanks @RMichaelPickering!

RMichaelPickering commented 6 years ago

I'm just testing the code now and finding the transcription results are not that great. Let me see if I've missed something obvious....

RMichaelPickering commented 6 years ago

I think the problem is that I've not included an anti-aliasing filter as part of my downsampler implementation. Looking into this now...

RMichaelPickering commented 6 years ago

@mediumTaj @kimberlysiva I haven't found a simple implementation of a low pass filter that works for anti-aliasing before downsampling. Can either of you think of a way to use the Unity built in audio Low Pass Filter without having to actually 'Play' the clip? I need a low pass filter at a frequency of 8 kHz or just below.

For now I'll try just eliminating the second audio channel from the recording.

RMichaelPickering commented 6 years ago

@mediumTaj It seems like audio streaming is simply not working. I'm not sure if there's something that is messing up the data stream in the SDK code, or in the Watson service itself. Perhaps it's in the code that implements this call: _listenSocket.Send(new WSConnector.BinaryMessage(AudioClipUtil.GetL16(clip.Clip))); ??

My code is now calling this with a clip consisting of 2400 samples at 48000 kHz sample rate, 1 channel.

I'll look at this code now.

kimberlysiva commented 6 years ago

Sorry @RMichaelPickering, I don't have much experience with audio in Unity. I was looking at Unity's speech-to-text sample, it seems they just send the clips at 44100 Hz.

https://bitbucket.org/Unity-Technologies/speech-to-text/src/b97d14e3735e20822a08a9ff4a676e471becffdb/Assets/SpeechToText/Scripts/Utilities/AudioRecordingManager.cs

This seems like a safe default, most devices should support it. I recommend we use Microphone.GetDeviceCaps to check if 22050 Hz is available, and if not select 44100 Hz. Yes, we'll be sending more data than required, but from what I can tell the Watson SST service is perfectly happy to accept this frequency. They may downsample on their end, but it's probably faster and more accurate than what we can come up with. As far as local performance goes, I wonder if sending more bits over the wire is faster than locally processing the audio. I'm just not sure!

Watson is charging per second of audio, not data amount (I believe), so we're not affecting the Watson bill here. We do affect the user's data usage on mobile, however.

I would like to see a solution to the one-second buffer that we currently have. That seems to be a low-hanging fruit. I haven't had a chance to look at your code yet, but if we can get that down to a smaller delay it would be great!

Finally, we want to keep all microphone code in the "example" space, I think. Users may want to use platform-specific solutions. For example, on Hololens I'm going to be tempted to use the MixedRealityToolkit streamer:

https://github.com/Microsoft/MixedRealityToolkit-Unity/blob/210c7486974533d671ced15a0f0e6819cb4f8bdb/Assets/HoloToolkit-Examples/Input/Scripts/VoiceChat/MicStream.cs

RMichaelPickering commented 6 years ago

@mediumTaj Does AudioClipUtil.GetL16 work for negative values?

I know we're trying to take a float value in the range of -1.0 to 1.0 and convert it to a 16 bit short integer. I'm not sure if 16 bit LPCM expects one bit is the sign or if it should just be a short, as implemented in C#, which would have the range of -32,768 to 32,767.

RMichaelPickering commented 6 years ago

@kimberlysiva As far as I can tell, Unity actually has NO control of the sample rate. Certainly this is true on my Windows 10 development laptop, where it is set at the operating system level. I did the query on Microphone.GetDeviceCaps and for me the answer is minimum: 48000, maximum: 48000. This is because for my microphone the default is set in Windows at 48000. It appears that I could choose 44,100 instead but I'm assuming that if I make this change at the Windows device level, I'll see only 44100 as both min and max.

To your point though, I'm fairly certain that IBM does charge for both individual Watson API calls as well as aggregate amounts uploaded and downloaded. Thus, I believe that sending audio at a higher rate will in fact impact a user's bill, if only through the influence of the aggregate upload charges. (As you point out, there would be a double-whammy effect of this on mobile users, who would effectively get billed for data charges by both their carrier and IBM!)

The other thing to look at is that on my Windows 10 laptop, my microphone is recording in 'stereo' -- that is, it's recording two channels, each at a 48 kHz sample rate. I thought it would be trivial to pseudo-downmix to mono just by ignoring one channel, but as I mentioned above, I don't believe the streaming is working even in this way. I keep getting horrible transcription results!

kimberlysiva commented 6 years ago

@RMichaelPickering We could fall back on whatever frequency device caps allows, if our preferred frequencies aren't available. There should be a simple way to write this to catch all cases.

Looking at my Bluemix bill for STT, the only line items are duration, not data size. Maybe @mediumTaj can double check that frequency doesn't affect billing.

Are you sure the Unity microphone is returning a stereo clip? It's hard to find a clear answer from Unity, but I was under the impression this clip was always mono. I don't have a stereo mic to test with, unfortunately... The closest answer I can find is this:

https://gamedev.stackexchange.com/questions/148308/does-unitys-microphone-functionality-support-stereo-input

kimberlysiva commented 6 years ago

@RMichaelPickering Also, out of curiosity how does your mic perform with the Watson demo?

https://speech-to-text-demo.mybluemix.net (run that in Chrome if your browser doesn't support mic input)

Want to make sure this is a Unity issue and not a general microphone issue.

RMichaelPickering commented 6 years ago

@kimberlysiva @mediumTaj Please see my microphone configuration options as shown in Windows:

RMichaelPickering commented 6 years ago

@kimberlysiva I just tried the demo and am getting fairly horrible results that way also. This is consistent with any prior tests that I've done in that actual live streaming transcriptions are rubbish compared to the results shown with pre-recorded samples. I think this must be due to some fundamental issue with how capturing live audio from a microphone is working, but I'm mystified as to what is the cause.

Somewhere in the Speech To Text API documentation online I saw a note to the effect that if audio is streamed up to the service at a rate higher than the rate required by the language model, it will be automatically downsampled. I wonder how that downsampling is being accomplished? Perhaps there is an issue with Watson's downsampling implementation?

mediumTaj commented 6 years ago

@RMichaelPickering @kimberlysiva Frequency should not effect pricing - It follows the following price structure

First thousand minutes per month are FREE
After that, per-minute pricing (USD) for monthly audio usage uses Graduated Tiered Pricing as follows:
$0.02 for minutes 1,001 - 250,000
$0.015 for minutes 250,001 - 500,000
$0.0125 for minutes 500,001 - 1,000,000
$0.01 for minutes 1,000,001 and up

kimberlysiva commented 6 years ago

@mediumTaj Thanks for double-checking!

@RMichaelPickering Ah, we probably should have started there :) If you're getting poor results from the web demo then you'll get poor results in Unity, no matter what we try. Is this a built-in microphone? I haven't had much luck with those. On my Mac I use a simple pair of Apple earpods, and on my Windows machine I use a really cheap but great transcription headset I picked up on Amazon. I think it's fair to ask any customers of your app to double check their microphone performance using that web demo before complaining about your app's performance!

To be clear, I've been getting pretty good results on all these platforms, and really great results once I started using a custom corpus. I think you should try another mic!

mediumTaj commented 6 years ago

I'm going to block out some time to look at how streaming is working here and if there are any improvements we can implement. A much better developer than me wrote this implementation. I do agree that the buffer leads to a built in half second delay and want to investigate that further.

mediumTaj commented 6 years ago

@kimberlysiva @RMichaelPickering I tend to have poor results depending on the microphone. Usually I connect a logitech webcam to the machine I'm testing on and get much better transcriptions.

RMichaelPickering commented 6 years ago

@kimberlysiva @mediumTaj I just tried capturing audio from my microphone using Audacity, and then uploading the sample to the Watson demo. It worked very well. However, when I tried to reproduce my earlier test using the 'Record Audio" button on the demo, it worked equally well. I'm not sure exactly what changed. One thing that is somewhat different about my audio recording is that Audacity can't seem to write audio to a file as raw 16-bit uncompressed LPCM, which is what we're trying to capture and stream in Unity. The closest option that I found is to save the file as a WAV format using 16 bit sample format.

RMichaelPickering commented 6 years ago

@kimberlysiva @mediumTaj FWIW, my development laptop is a very current 'gaming' model with an array microphone. I regularly use it for audio and video conference calls with other team members using Google Hangouts with no problems at all. I'm doing my testing in a reasonably quiet home environment and while sitting with the laptop on a pillow in my lap. In short, this seems to be a very nearly ideal test case, especially considering that my actual Watson STT use case involves users who are potentially some distance away from the audio capture device, which will be embedded in a 'virtual assistant' device similar in concept to Amazon Alexa or Google Home. In these use cases, some 'far field' user voice interactions are to be expected, but I'm not even attempting to replicate such a more challenging test case here. On the other hand, it does need to be real time capture as waiting to finish recording every command simply won't cut it -- particularly because a user won't have a record button to press to start and stop each command! Getting minimal latency is also important or the user will get frustrated.

RMichaelPickering commented 6 years ago

@kimberlysiva In the interest of trying to replicate your 'pretty good results' that you're able to obtain without using a custom corpus, could you please share your settings? Are you doing 'live streaming' through Unity? What sample rate is being used and is the recording one or two channels? Are the results being sent through the Streaming - Websocket API of the Watson Unity SDK? Is any external processing or audio compression being performed? Is the audio being sent in 16 bit raw LPCM format? I'm afraid that I have no Mac on which to test, but I can try to ensure the settings for my Windows laptop microphone match yours as closely as possible, within the limits of my hardware. As another alternative, are you able to change the settings on your microphone to more closely match mine?

kimberlysiva commented 6 years ago

I just tested on my Windows machine with the latest develop branch. I'm testing the ExampleStreaming scene that comes with the SDK. My microphone supports a wide array of frequencies:

micsettings

I've tested with 44100 Hz/16 bit and 48000 Hz/16 bit. In both cases Unity reported the min/max frequencies to be the same thing that I set in the OS. I left the _recordingHZ set to 22050.

Both frequencies seem to work equally well. I'd say it understands 90-95% of what I'm saying. This is a dictation-quality headset though, it's definitely going to perform better than a built-in mic.

I'm honestly not sure about the 16 bit raw LPCM part. I'm guessing Unity does a bit of processing to the mic input, but I'm not really sure what's coming out. I haven't changed any code, this is a straight pull of the develop branch.

RMichaelPickering commented 6 years ago

@kimberlysiva Thanks, that helped!

This is an interesting result. What's interesting to me is that you've left the _recordingHZ set to the default of 22050. I'm not trying to dispute your results, I just wonder how that works. Not one of the frequencies that your mic supports is actually 22050! If by chance one were able to downsample a recording made with a sample rate of 44,100 Hz by half, one would then actually have a recording with a sample rate of 22,050. But where is the code that does that? And how does it also work if the sample rate is set to 48,000? In that case, wouldn't the effective sample rate, after downsampling, be 24,000 not 22,050? It is also interesting that, for your mic, the default sampling rates all include 2 channels, like mine. Is the data being sent that way up to Watson, or is it somehow being downmixed to mono? I may try to set _recordingHZ to 22050, also, in my hacked Streaming code, and see what happens!

kimberlysiva commented 6 years ago

@RMichaelPickering I'm guessing that Unity is doing some work on the microphone input behind the scenes. Maybe we really are getting mono at 22050 Hz. There should be an easy way to check this, I'm not sure how though...

RMichaelPickering commented 6 years ago

@kimberlysiva That makes two of us!! Out of curiousity: you mentioned that you are getting about a 90-95% recognition rate. Are you also noticing about a one second delay? As far as I can see, with the standard code from the development branch, that would be a best case scenario, assuming zero network latency and zero Watson recognition time.

kimberlysiva commented 6 years ago

@RMichaelPickering Yep, there's around a one-second delay. It'd be great to improve that!

RMichaelPickering commented 6 years ago

@kimberlysiva For sure! I've done most of the hard work already, but unfortunately it requires knowing the actual audio sample rate -- or at least this seems to be the case, because I've not been getting anything like a decent result so far! When I get more time next week I'll see if using 20050 will be the right kind of magic...

mediumTaj commented 6 years ago

@kimberlysiva @RMichaelPickering I've been meaning to start public slack for the Watson Unity SDK. @kimberlysiva can you send your email address I think I have @RMichaelPickering 's address.

mediumTaj commented 6 years ago

Closing this issue since it's resolved. We can continue this discussion on the slack or https://github.com/watson-developer-cloud/unity-sdk/issues/279

watson-developer-cloud / unity-sdk

[speech-to-text] Word alternatives are never populated #275