Sample: Prime speech using a tap-to-dismiss splash screen on Safari

rodmcleay commented 6 years ago

I'm have a webchat conrol for a bot that that is up and running and working well in Chrome. The link How to enable speech in Web Chat shows how to set this up and we have done it exactly like this.

It mentioned multiple browsers, but does not specify Safari in any way.

We need this working on an iPhone, however it just doesn't seem to work, there is not a lot of feedback from the browser, the icon changes and it appears to have turned on the microphone after access is approved.

Nothing spoken is recorded/recognized and the text area of the bot stays empty, no 'listining....' or any other indication its working other than the red microphone on icon in the browser header. clicking the icon mutes and un-mutes as you'd expect, it just doesn't seem to be connected to the webchat control in the browser.

All of my investigation appears to go around in circles.

Has anyone achieved this with WebChat, or any other direct line component?
Or Can anyone confirm definitely doesn't work with safari so I can stop banging my head against it.
Are there any alternatives to webchat that do work in iPhone/Safari?

Thanks for taking the time to read, any assistance would be much appreciated, I'm at the end on this investigation and pulling my hair out.

compulim commented 6 years ago

@rodmcleay, we have just tested it on an iPhone with iOS 11.4, running Safari, and use Cognitive Services Speech. It works.

Can you check for a few things?

Your iPhone is running iOS 11+
You are using Safari, not Chrome or Edge app
Settings app > Safari > Camera & Microphone Access is enabled
Your web site is on HTTPS. Safari block microphone on insecure HTTP
You are running on iPhone
- We tested it don't run on iPod with iOS 11.4, we haven't test iPad yet
Your page is using Cognitive Services, not browser speech (a.k.a. WebSpeech API)

I agree we need to make the speech detection more robust and informative. But also need to make sure detection doesn't pop up the "Access to Microphone" dialog too early. But unfortunately, in some cases, you can't have both.

shubhamchawla commented 6 years ago

How to get it working on Chrome for iOS, any help would be greatly appreciated. Thanks in advance.

compulim commented 6 years ago

@shubhamchawla It doesn't work in Chrome for iOS because Chrome does not support WebRTC on iOS. The only browser on iOS which support WebRTC is Safari right now.

rodmcleay commented 6 years ago

Hi Compulim, I don't mind the popup asking access, that is understandable and expected. I'm using cognitive services, as per the code below, and it is on HTTPS, working fine in Chrome on windows and android phones. iPhone is iOS11.3.1

const speechOptions = {
    speechRecognizer: new CognitiveServices.SpeechRecognizer({
        fetchCallback: (authFetchEventId) => getToken(),
        fetchOnExpiryCallback: (authFetchEventId) => getToken()
    }),
    speechSynthesizer: new CognitiveServices.SpeechSynthesizer({
        gender: CognitiveServices.SynthesisGender.Female,
        subscriptionKey: '@System.Configuration.ConfigurationManager.AppSettings["CognitiveKey"]',
        voiceName: 'Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)'
    })
};

Is that the config you would expect?

Get token is on the client at the moment.

function getToken() {
         // Normally this token fetch is done from your secured backend to avoid exposing the API key and this call
         // would be to your backend, or to retrieve a token that was served as part of the original page.
         return fetch(
           'https://api.cognitive.microsoft.com/sts/v1.0/issueToken',
           {
             headers: {
               'Ocp-Apim-Subscription-Key': '@System.Configuration.ConfigurationManager.AppSettings["CognitiveKey"]'
             },
             method: 'POST'
           }
         ).then(res => res.text());
       }

rosskyl commented 6 years ago

I got it working with Safari and Firefox with the following javascript. I just include this in a javascript file while still using the linked CognitiveServices.js file from the cdn. I use the bing speech recognizer and the browser speech synthesizer.

This works because their current version uses window.navigator.getUserMedia which is being deprecated so change that to use window.navigator.mediaDevices.getUserMedia. Then Safari has problems with playing audio using the speech synthesizer programatically, so I register an event to the microphone click to play a sound from the speech synthesizer and remove that event. Finally, Safari also has problems recording audio programatically again so I create the audio context before actually needing it and connect the processor. Safari doesn't allow recording audio or playing audio with the speech synthesizer unless it is a direct result from a touch or tap. This includes the then part of the promise returned from window.navigator.mediaDevises.getUserMedia.

I've tested this with the latest version of Chrome, Firefox, and Edge on windows 10, Chrome on android, and Safari on an iPad pro. The only browser I haven't had it work on is internet explorer.

// Necessary for safari
// Safari will only speak after speaking from a button click
var isSafari = /^((?!chrome|android).)*safari/i.test(navigator.userAgent);

function SpeakText() {
    var msg = new SpeechSynthesisUtterance();
    window.speechSynthesis.speak(msg);

    document.getElementsByClassName("wc-mic")[0].removeEventListener("click", SpeakText);
}

if (isSafari) {

    window.addEventListener("load", function () {
        document.getElementsByClassName("wc-mic")[0].addEventListener("click", SpeakText);
    });
}

// Needed to change between the two audio contexts
var AudioContext = window.AudioContext || window.webkitAudioContext;

var context;
var processor;

// Overrides the base constructor to use a singleton like structure
// Needed for Safari
var BasePrototype = AudioContext.prototype;
AudioContext = function () {
    return context;
};
AudioContext.prototype = BasePrototype;

// Sets the old style getUserMedia to use the new style that is supported in more browsers
window.navigator.getUserMedia = function (constraints, successCallback, errorCallback) {
    context = new BasePrototype.constructor;
    processor = context.createScriptProcessor(1024, 1, 1);
    processor.connect(context.destination);

    window.navigator.mediaDevices.getUserMedia(constraints)
        .then(function (e) {
            successCallback(e);
        })
        .catch(function (e) {
            errorCallback(e);
        });
};

compulim commented 6 years ago

@rosskyl this is good hack, without the need to touch the Web Chat code.

Can you explain a little bit more on synthesis part? Do you mean Safari requires touch/tap for both synthesis and recognition part?

rosskyl commented 6 years ago

The first time you use either the speech synthesis or recognizer, it needs to be triggered by a user touch or tap. After the speech synthesis was triggered once, then I was able to get it to work without needing a touch or tap. Apple requires this to prevent the web page from automatically playing audio or recording audio even though all of the other browsers allow it.

The speech synthesis or recognizer will not work if they are triggered from a setTimeOut or from the .then portion of a promise (which is what the newer version of getUserMedia uses. For getUserMedia, the AudioContext object must be created from the tap and the processor created and connected from the tap. The recording can be done later.

compulim commented 6 years ago

@rosskyl Thanks for the explanation. I totally understand the recognizer requirement for tap/touch, but it just feel weird to me for the synthesis part. I bet one don't need to tap/touch for WebAudio.

Anyway, it's Apple's requirement then we need to work with it. 😉

rosskyl commented 6 years ago

You could try it without adding the event listener, but I couldn't get it to work without it. You could also write your own custom speech synthesizer and try it with WebAudio. I originally wrote my own that used the speech synthesizer, but ended up with the same problem the BrowserSpeechSynthesizer had. I fixed it with the event listener and figured out it worked with the BrowserSpeechSynthesizer also.

compulim commented 6 years ago

Thanks @rosskyl. I will make this a bug.

BTW, we are planning to polyfill HTML WebSpeech API using Cognitive Services. So we don't need to maintain two different APIs, and we can bring Cognitive Services to platforms that does not support WebSpeech (e.g. Edge, desktop Firefox).

As always, we welcome contributions, and we will take quality projects as dependencies.

Anyway, note to bug fixer:

Safari requires touch/tap to enable both speech recognition and speech synthesis
We need to workaround this, one possible move:
1. On any touch/tap on Web Chat, synthesis an empty string to prime the browser

serpino commented 6 years ago

Hi @rosskyl I am using the chat and without using your javascript code, the voice conversation works correctly, except for IOS.

If I add your code to the project, it gives me an error when I press the microphone. Can you please help me?

The error I get in Chrome is this:

export function __awaiter(thisArg, _arguments, P, generator) {
    return new (P || (P = Promise))(function (resolve, reject) {
        function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
        -->(IN THIS LINE)-->function rejected(value) { try { step(generator.throw(value)); } catch (e) { reject(e); } }<--(In this line)
        function step(result) { result.done ? resolve(result.value) : new P(function (resolve) { resolve(result.value); }).then(fulfilled, rejected); }
        step((generator = generator.apply(thisArg, _arguments || [])).next());
    });
}

Uncaught (in promise) TypeError: Illegal invocation
     at MicAudioSource.TurnOn (MicAudioSource.ts: 110)
     at MicAudioSource.Listen (MicAudioSource.ts: 182)
     at MicAudioSource.Attach (MicAudioSource.ts: 131)
     at Recognizer.Recognize (Recognizer.ts: 97)
     at SpeechRecognizer. <anonymous> (SpeechRecognition.ts: 153)
     at step (tslib.es6.js: 91)
     at Object.next (tslib.es6.js: 72)
     at tslib.es6.js: 65
     at new Promise (<anonymous>)
     at Object .__ awaiter (tslib.es6.js: 61)

And in Firefox is: TypeError: 'get state' called on an object that does not implement interface BaseAudioContext

I am using cognitiveServices. What can be failing?

Thanks

rosskyl commented 6 years ago

I believe that is because some of the internals for the cognitiveServices changes. The following is what I currently use:

var isSafari = /^((?!chrome|android).)*safari/i.test(navigator.userAgent);

function SpeakText() {
    var msg = new SpeechSynthesisUtterance();
    window.speechSynthesis.speak(msg);

    document.getElementsByClassName("wc-mic")[0].removeEventListener("click", SpeakText);
}

if (isSafari) {

    window.addEventListener("load", function () {
        document.getElementsByClassName("wc-mic")[0].addEventListener("click", SpeakText);
    });
}

// Needed to change between the two audio contexts
var AudioContext = window.AudioContext || window.webkitAudioContext;

// Sets the old style getUserMedia to use the new style that is supported in more browsers even though the framework uses the new style
if (window.navigator.mediaDevices.getUserMedia && !window.navigator.getUserMedia) {
    window.navigator.getUserMedia = function (constraints, successCallback, errorCallback) {
        window.navigator.mediaDevices.getUserMedia(constraints)
            .then(function (e) {
                successCallback(e);
            })
            .catch(function (e) {
                errorCallback(e);
            });
    };
}

I have this working for all of the major browsers on Windows, android, macOS, and iOS.

serpino commented 6 years ago

@rosskyl This works much better, at least for the rest of browsers.

I already made it work for any mobile device.

In the end I did it in the following way: When it detects that a response arrives to the user's message, I call this function by passing the text of the response and the language in which it should speak.

function playMessage(msgText, locale ){
  var msg = new SpeechSynthesisUtterance();
            msg.text = msgText;
            msg.volume = 1; // 0 to 1

            msg.rate = 1; // 0.1 to 9

            msg.pitch = 1; // 0 to 2, 1=normal

            msg.lang = locale ;//"en-US";
            speechSynthesis.speak(msg);
}

A part I do some other checks as if the user is on a mobile device or if the message comes from the micro or not Thank you very much!

rosskyl commented 6 years ago

Just a note that this will only work for the browser speech synthesizer. It does not work for the cognitive services speech synthesizer.

I tried to prime it like above by creating an audio context and playing a tone but that does not work. I can get the tone to play on the mic tap, but can't get it to work programmatically.

serpino commented 6 years ago

@rosskyl Right. I only use this process when cognitive services do not work. That depends on the browser. So I use both methods depending on the browser.

cwhitten commented 5 years ago

Closing due to lack of activity - see linked samples issues above.

microsoft / BotFramework-WebChat

Sample: Prime speech using a tap-to-dismiss splash screen on Safari #995