[Help Wanted] Issues Proxying Browser Transcription Stream to Microsoft Transcription Service Title

Background

I am currently working on a web application where an interactive avatar communicates with users using natural language. The key feature of this application is to capture user speech in real time, transcribe it into text using speech-to-text services, and then use this text to generate responses which the avatar conveys back to the users.

The Problem

I have encountered challenges in implementing the speech-to-text component. My goal is to proxy the user's voice stream from the browser to Microsoft's Transcription Service. I chose this approach, because IMHO streaming the voice data from the browser directly to speech service would not allow to to manage rate limits and cause cross-site scripting, content security, and cookie-related issues that arise with direct browser-to-Microsoft service communication.

Versions

Node.js: v18.18.2v
  "ejs": "^3.1.9",
"express": "^4.18.2",
"http-errors": "~1.6.3",
"http-proxy-middleware": "^2.0.6",
"microsoft-cognitiveservices-speech-sdk": "^1.33.1",
"morgan": "~1.9.1",
"request": "^2.88.2",
"ws": "^8.14.2"

Host: Win11 / Azure Web App

Expected Behavior

The server should act as a proxy, handling both HTTP and WebSocket traffic from the browser and forwarding it to Microsoft's Transcription Service. This setup is intended to manage rate limiting and handle security concerns more effectively than a direct browser-to-Microsoft service connection.

Code

express routing:


// proxy http requests to the speech-to-text service
router.all('/speech/recognition/conversation/cognitiveservices', (req, res) => {
  res.setHeader('Access-Control-Allow-Origin', req.headers.origin || req.headers.host);
  axios({
    url: 'https://westeurope.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=de-DE',
    method: req.method, // Use the method of the incoming request
    headers: {
      'Ocp-Apim-Subscription-Key': subscriptionKey,
      'Content-Type': req.headers['content-type'] // Use the content-type of the incoming request
    },
    data: req.body
  })
  .then(response => {
    console.log('Speech-to-text response:', response.data);
    res.send(response.data);
  })
  .catch(error => {
    console.error('Speech-to-text error:', error);
    res.status(500).send(error.message);
  });
});

//proxy websocket requests to the speech-to-text service
const proxyMiddleware = createProxyMiddleware('/speech/recognition/conversation/cognitiveservices', {
  target: 'wss://westeurope.stt.speech.microsoft.com',
  changeOrigin: true,
  ws: true,
  pathRewrite: {
    '^/speech/recognition/conversation/cognitiveservices': '/speech/recognition/conversation/cognitiveservices/v1'
  },
  onProxyReqWs: (proxyReq, req, socket, options, head) => {
    proxyReq.setHeader('Ocp-Apim-Subscription-Key', subscriptionKey);
    // Set any other headers you need here
  }
});

router.use(proxyMiddleware);

frontent call following the example: Use continuous recognition

const speechConfig = SpeechSDK.SpeechConfig.fromHost(new URL('http://localhost:3000/speech/recognition/conversation/cognitiveservices'));
const audioInConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

Current Behavior

When I test the connection to http://localhost:3000/speech/recognition/conversation/cognitiveservices with Postman i get:

{
    "RecognitionStatus": "Success",
    "Offset": 0,
    "Duration": 200000,
    "DisplayText": ""
}

Testing with a Websocket client I also get something back, that looks like its coming from Microsoft:

WebSocket connection established
[HPM] Upgrading to WebSocket
[HPM] WebSocket error: Error: read ECONNRESET
    at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
  errno: -4077,
  code: 'ECONNRESET',
  syscall: 'read'
}
WebSocket connection closed
Continuous detection started
Transcription aborted: undefined
Transcription session ended

However, browser logging always shows:

Recognition started [transcriptionManager.js:55:13](http://localhost:3000/js/transcriptionManager.js)
CANCELED: Reason=0 [transcriptionManager.js:111:19](http://localhost:3000/js/transcriptionManager.js)
CANCELED: ErrorCode=4 [transcriptionManager.js:113:21](http://localhost:3000/js/transcriptionManager.js)
CANCELED: ErrorDetails=Unable to contact server. StatusCode: 500, undefined Reason: SyntaxError: An invalid or illegal string was specified [transcriptionManager.js:114:21](http://localhost:3000/js/transcriptionManager.js)
Recognition stopped.

Additional Thoughts

I wonder what good practice setup is for my use case. Considering whether an Azure-managed API could be a more scalable and efficient solution for this. The managed API could potentially provide better control over rate limiting and streamline the handling of transcription requests. However I have no experience so far with this and I first want to ensure what is the right way to go.

Questions

Are there best practices or recommended patterns for proxying such real-time data streams (HTTP and WebSocket) to Microsoft's Transcription Service?
Is there a need for additional configuration or handling in our current setup to ensure reliable and secure communication between the browser, our server, and Microsoft's service?

Thank you very much for your assistance!

Rubber duck debugging ...

const speechConfig = SpeechSDK.SpeechConfig.fromHost(new URL('http://localhost:3000/speech/recognition/conversation/cognitiveservices'));

i need to use ws:// here, not http ... however, i would be very happy to get a brief guideline on the good practice architecture.

i changed architecture to "Azure Managed API" and set one up with to :

wss://westeurope.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1

its working fine with this code:

const customEndpoint = new URL('wss://teekenner-berater-api.azure-api.net');

const audioConfig = speechSdk.AudioConfig.fromWavFileInput(fs.readFileSync('test.wav'));
const speechConfig = speechSdk.SpeechConfig.fromEndpoint(customEndpoint);

speechConfig.speechRecognitionLanguage = 'de-DE'; // Set the language of the speech

const recognizer = new speechSdk.SpeechRecognizer(speechConfig, audioConfig);

recognizer.recognizeOnceAsync(result => {
    switch (result.reason) {
        case speechSdk.ResultReason.RecognizedSpeech:
            console.log(`Transcription: ${result.text}`);
            break;
        case speechSdk.ResultReason.NoMatch:
            console.log("No speech could be recognized.");
            break;
        case speechSdk.ResultReason.Canceled:
            const cancellation = speechSdk.CancellationDetails.fromResult(result);
            console.log(`Recognition canceled: ${cancellation.reason}`);
            if (cancellation.reason === speechSdk.CancellationReason.Error) {
                console.log(`Error details: ${cancellation.errorDetails}`);
            }
            break;
    }
});

I get back the transcribed text.

but when I use it with the following code in the frontent:

const speechConfig = SpeechSDK.SpeechConfig.fromHost(new URL(config.transcriptionService.endpoint));

        speechConfig.speechRecognitionLanguage = config.transcriptionService.speechRecognitionLanguage;
        const audioInConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

        this.recognizer = new SpeechSDK.SpeechRecognizer(speechConfig, audioInConfig);

i receive a lot of these errors in firefox:

Firefox can’t establish a connection to the server at wss://XXXXXXXXX.azure-api.net/speech/recognition/conversation/cognitiveservices/v1?language=de-DE&format=simple&X-ConnectionId=DCBD93AAAAABF8A4F7FF95XXXXX44F.

any ideas?

I would be very thankful, I try to bring the live transcription thing live since a week.

Whoever gives me the final hind, I will paypal a bar near to so you can grab a beer there.

@LiGo666, thank you for using Speech SDK and writing this up. The two APIs, fromHost, and fromEndpoint, have subtly different behavior, and should not be used interchangably. fromHost expects just the hostname without the ending path (so "https\:\/\/foo.bar.com" but not "https\:\/\/foo.bar.com/speech/recogntion/conversation/cognitiveservices/v1"). fromEndpoint is the opposite, expecting the host + path ("https\:\/\/foo.bar.com/speech/recogntion/conversation/cognitiveservices/v1" is okay).

If you're still having trouble, would you mind taking a log using: SpeechSDK.Diagnostics.SetLoggingLevel(SpeechSDK.LogLevel.Debug);

And posting your output as a text file?

Thank you Glenn,

I try to use only the hostname now:

const speechConfig = SpeechSDK.SpeechConfig.fromHost(new URL('wss://xxx-api.azure-api.net'));

In the Azure Managed API i configured this as a base-URL: wss://xxx-api.azure-api.net

this websocket URL: wss://westeurope.stt.speech.microsoft.com

I still get a HTTP error 400:

Firefox can’t establish a connection to the server at wss://xxx-api.azure-api.net/speech/recognition/conversation/cognitiveservices/v1?language=de-DE&format=simple&X-ConnectionId=B6648AEAF4F743FE8E1E4XXXXDC62EAB.

Managed API is defined like this:

<policies>
    <inbound>
        <base />
        <!-- Add subscription key to the request header -->
        <set-header name="Ocp-Apim-Subscription-Key" exists-action="override">
            <value>XXXXX489881XXXXXXfd901ab42XXXXX</value>
        </set-header>
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

Thank you Glenn,

I try to use only the hostname now:

const speechConfig = SpeechSDK.SpeechConfig.fromHost(new URL('wss://xxx-api.azure-api.net'));

In the Azure Managed API i configured this as a base-URL: wss://xxx-api.azure-api.net

this websocket URL: wss://westeurope.stt.speech.microsoft.com

I still get a HTTP error 400:

Firefox can’t establish a connection to the server at wss://xxx-api.azure-api.net/speech/recognition/conversation/cognitiveservices/v1?language=de-DE&format=simple&X-ConnectionId=B6648AEAF4F743FE8E1E4XXXXDC62EAB.

This sounds like a configuration issue with Azure Managed API, along the lines of a port not being open (443 for wss). There doesn't seem to be anything specific to JS Speech SDK I can assist with here.

To keep our open issues list up to date, this item will be closed since it's been inactive and needs more information to proceed. Please file a new issue (and feel free to reference this one) if there's new information we can follow up on. Thank you!

microsoft / cognitive-services-speech-sdk-js