twilio / media-streams

Quick start guides for configuring and consuming Twilio Media Streams
100 stars 81 forks source link

Converting Twilio live stream from WebSocket to the supported audio format to Microsoft Azure using NodeJs #162

Open adama19 opened 2 years ago

adama19 commented 2 years ago

Hello, please I am trying to integrate the Twilio live media stream with Microsoft Azure STT in order to get a live transcription of the user input. My problem at the moment is I am unable to convert the payload to the wave/PCM format which is supported by azure. I saw a similar solution on this topic here (https://www.twilio.com/blog/live-transcription-media-streams-azure-cognitive-services-java) but the issue is this is using Java programming language while I am trying to do this with NodeJs. Can you please help

below is the code I am using

const WebSocket = require("ws")
const express = require("express")
const app = express();
const server = require("http").createServer(app)
const path = require("path")
const base64 = require("js-base64");
const alawmulaw = require('alawmulaw');
const wss = new WebSocket.Server({ server })

//Include Azure Speech service 
const sdk = require("microsoft-cognitiveservices-speech-sdk")
const subscriptionKey = '2195XXXXXXXXXXXXXXXXXX'
const serviceRegion = 'southeastasia'

// Hard code the variables 
//const variables = require("./config/variables")
const language = "en-US"

const azurePusher = sdk.AudioInputStream.createPushStream(sdk.AudioStreamFormat.getWaveFormatPCM(8000, 16, 1))
const audioConfig = sdk.AudioConfig.fromStreamInput(azurePusher);
const speechConfig = sdk.SpeechConfig.fromSubscription(subscriptionKey, serviceRegion);

speechConfig.speechRecognitionLanguage = language;
speechConfig.enableDictation();
const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

recognizer.recognizing = (s, e) => {
  console.log(`RECOGNIZING: Text=${e.result.text}`);
};

recognizer.recognized = (s, e) => {
  if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
      console.log(`RECOGNIZED: Text=${e.result.text}`);
  }
  else if (e.result.reason == sdk.ResultReason.NoMatch) {
      console.log("NOMATCH: Speech could not be recognized.");
  }
};

recognizer.canceled = (s, e) => {
  console.log(`CANCELED: Reason=${e.reason}`);

  if (e.reason == sdk.CancellationReason.Error) {
      console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
      console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
      console.log("CANCELED: Did you update the key and location/region info?");
  }

  recognizer.stopContinuousRecognitionAsync();
};

recognizer.sessionStopped = (s, e) => {
  console.log("\nSession stopped event.");
  recognizer.stopContinuousRecognitionAsync();
};

recognizer.startContinuousRecognitionAsync(() => {
  console.log("Continuous Reco Started");
},
  err => {
      console.trace("err - " + err);
      recognizer.close();
      recognizer = undefined;
  });

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

   ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);

        break;
      case "media":
        var streampayload = base64.decode(msg.media.payload)
        var data = Buffer.from(streampayload)
        var pcmdata = Buffer.from(alawmulaw.mulaw.decode(data))
        //console.log(msg.mediaFormat.encoding)

        // process.stdout.write(msg.media.payload + " " + " bytes\033[0G");
        // streampayload = base64.decode(msg.media.payload, 'base64');
        // let data = Buffer.from(streampayload);
        azurePusher.write(pcmdata)
        break;
      case "stop":
        console.log(`Call Has Ended`);
        azurePusher.close()
        recognizer.stopContinuousRecognitionAsync()
        break;
    }
  });

})

app.post("/", (req, res) => {
  res.set("Content-Type", "text/xml");

  res.send(
    `<Response>
       <Say>
            Leave a message
       </Say>
       <Start>
           <Stream url="wss://${req.headers.host}" />
       </Start>
       <Pause legnth ='60' />
    </Response>`
)
});

console.log("Listening at Port 8080");
server.listen(8080);

Please help in converting the media payload which comes in mu-law format to the supported PCM format by Microsoft Azure for Speech to text transcription.

imkhubaibraza commented 2 years ago

I'm also facing the same problem

sahilpal0 commented 10 months ago

I'm also facing the same problem

  • The transcription is accurate if I stream audio chunks from an audio file.
  • Getting random few words If I stream audio chunks from Twilio calls.

I dont have so much knowledge on low level things but what i notice is when i save the ulaw format of twilio to wav format and try to play it. it will work perfectly but when i try to send it azure that file audio chunks for continousrecogniztion it doesn't work's but when i again convert that wav file in to a 16khz 8bit depth mono through the external websites it and give it again to azure it seems to work perfectly them so what iam trying to say it something we're doing wrong while conversion. it seems fine and working but still something is missing

github-ai-user commented 19 hours ago

Any solution?