[Bug]: JS SpeechSDK.AudioConfig.fromDefaultMicrophoneInput capturing Teams/Zoom call speaker sounds where as JAVA SpeechSDK.AudioConfig.fromDefaultMicrophoneInput not

ru4sam326 commented 2 months ago

What happened?

Hi Team,

I'm using JS SDK capturing the speech using SpeechSDK.AudioConfig.fromDefaultMicrophoneInput, If the teams/zoom call is going on through the desktop app, teams/zoom call other participants sounds coming from the speakers are coming through the above microphone input. I'm using 1.36.0 version.

Where as If i'm doing the same in JAVA with 1.37.0 version it is not capturing the Teams/zoom call other participants sounds coming from the speakers.

Please let me know how to resolve this in js.

Version

1.36.0 (Latest)

What browser/platform are you seeing the problem on?

No response

Relevant log output

Javascript code using:

const audioConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

this.recognizer = new SpeechSDK.SpeechRecognizer(this.speechConfig, audioConfig);
    this.recognizer.sessionStarted = (s: any, e: any) => {

    }
    this.recognizer.speechStartDetected = (s: any, e: any) => {
      console.log('speechStartDetected:', s);

    }

    this.recognizer.recognizing = (s: any, e: any) => {
      this.displayText = 'You are speaking...';
    };

    this.recognizer.recognized = async (s: any, e: any) => {
      console.log('recognized:', s, e);
}

JAVA code:

AudioProcessingOptions audioProcessingOptions = AudioProcessingOptions.create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_NONE);
        AudioConfig audioConfig = AudioConfig.fromDefaultMicrophoneInput(audioProcessingOptions);
        SpeechRecognizer speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);

        stopTranslationWithFileSemaphore = new Semaphore(0);

        speechRecognizer.recognizing.addEventListener((s, e) -> {
//          System.out.println("RECOGNIZING: Text=" + e.getResult().getText());
        });

        speechRecognizer.recognized.addEventListener((s, e) -> {
            if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
                System.out.println("RECOGNIZED: Text=" + e.getResult().getText());
            }
            else if (e.getResult().getReason() == ResultReason.NoMatch) {
                System.out.println("NOMATCH: Speech could not be recognized.");
            }
        });

ru4sam326 commented 2 months ago

Can u plz respond?

glharper commented 2 months ago

@ru4sam326 Thank you for using the JS Speech SDK, and writing this issue up. The Java Speech SDK includes an echo cancellation model that mitigates background noise, while the JS Speech SDK does not, which is why the discrepancy you've encountered exists. Implementation in the JS Speech SDK is TBD. A couple of options on your end:

limit audio devices to headsets
implement a "push to talk" button that mutes speaker out while on.

ru4sam326 commented 2 months ago

any example to use java api for browser side chat. We are building a chatbot listening to realtime speech to do the analysis, we went with js which is causing the noise as you mentioned.

So now how to stream the browser stream to java side speechsdk ??

glharper commented 2 months ago

If you are implementing this chatbot on Windows, there's an "Acoustic Echo Cancellation" setting you can turn on to see if the noise for JS is mitigated, see attached picture: Image 5-2-24 at 12 02 PM

For Java, if you can access the outgoing audio stream, presumably you can transform to a 16KHz 16-bit PCM stream and adapt the push stream code here to send it to the Java recognizer for recognition.

ru4sam326 commented 1 month ago

Hi Team,

Could you plz share some samples on sending the audio stream from browser to JAVA. Will be really helpful for us. Tried some but they are not working.

Approach tried:

JS Code:

 async initRecognition() {
    const stream = await navigator.mediaDevices.getUserMedia(
      {audio:true}
    );

    const options: RecordRTC.Options = {};
    options.type = "audio";
    options.mimeType = "audio/wav";
    options.timeSlice = 3000
    options.recorderType = StereoAudioRecorder
    options.numberOfAudioChannels = 1
    options.desiredSampRate=16000
    options.sampleRate=16000
    options.bitrate=16
    options.ondataavailable = async (blob:Blob)=> this.dataavailable(blob);

    const recorder = new RecordRTCPromisesHandler(stream,options);
    recorder.startRecording()

      async dataavailable(blob: Blob) {
    console.log('blob',blob)
     if(this.socket.OPEN=== this.socket.readyState){
       this.socket.send(blob)
     }

JAVA Code:

public void handleBinaryMessage(WebSocketSession session, BinaryMessage  message) throws Exception {
        byte[] arr = new byte[message.getPayloadLength()];
        message.getPayload().get(arr);
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("**********************", "******");
        speechConfig.setSpeechRecognitionLanguage("en-IN");

        System.out.println("Started before");
        Semaphore stopRecognitionSemaphore = new Semaphore(0);
        PushAudioInputStream pushStream = AudioInputStream.createPushStream();
        System.out.println("Started after");

        // Creates a speech recognizer using Push Stream as audio input.
        AudioConfig audioInput = AudioConfig.fromStreamInput(pushStream);

        SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioInput);
            recognizer.recognized.addEventListener((s, e) -> {
                if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
                    System.out.println("RECOGNIZED: Text=" + e.getResult().getText());
                } else if (e.getResult().getReason() == ResultReason.NoMatch) {
                    System.out.println("NOMATCH: Speech could not be recognized.");
                }
            });

Thanks, Samba

ru4sam326 commented 1 month ago

Hi @glharper

Ignore the above one. Able to stream the audio from browser to backend still unable to cancel the Acoustic echo, Could you please suggest.

glharper commented 1 month ago

@ru4sam326 Since you're using the Java Speech SDK, this question is better asked in the native Speech SDK repo. This repo is specifically for the JavaScript Speech SDK.

ru4sam326 commented 1 month ago

Thanks @glharper. Raised https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2381, for reference. In case someone follows.

microsoft / cognitive-services-speech-sdk-js