microsoft / cognitive-services-speech-sdk-go

Go bindings for the Microsoft Cognitive Services Speech SDK
MIT License
85 stars 30 forks source link

How to get sentence word timestamp results for real-time speech recognition ? #122

Open wxbool opened 7 months ago

wxbool commented 7 months ago

I'm experimenting with real-time speech recognition using go sdk, tested the basic example, and I'm wondering how to receive word timestamp information for real-time recognized sentences? I found a config.RequestWordLevelTimestamps() enable option in the SDK, but I don't receive the word timestamps in the Recognizing / Recognized event, only the sentence recognition results.

The code is as follows:

package main

import (
    "bufio"
    "encoding/json"
    "fmt"
    "github.com/Microsoft/cognitive-services-speech-sdk-go/audio"
    "github.com/Microsoft/cognitive-services-speech-sdk-go/speech"
    "os"
)

func LogDumpObjectToJson(data interface{}, prefixs ...interface{}) {
    if data != nil {
        jsonData, _ := json.Marshal(data)
        fmt.Println(prefixs, string(jsonData))
    }
}

func sessionStartedHandler(event speech.SessionEventArgs) {
    defer event.Close()
    fmt.Println("Session Started (ID=", event.SessionID, ")")
}

func sessionStoppedHandler(event speech.SessionEventArgs) {
    defer event.Close()
    fmt.Println("Session Stopped (ID=", event.SessionID, ")")
}

func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
    defer event.Close()
    fmt.Println("Recognizing:", event.Result.Text)
    LogDumpObjectToJson(event, "event Recognizing : ")
}

func recognizedHandler(event speech.SpeechRecognitionEventArgs) {
    defer event.Close()
    fmt.Println("Recognized:", event.Result.Text)
    LogDumpObjectToJson(event, "event Recognized : ")
}

func cancelledHandler(event speech.SpeechRecognitionCanceledEventArgs) {
    defer event.Close()
    fmt.Println("Received a cancellation: ", event.ErrorDetails)
    fmt.Println("Did you set the speech resource key and region values?")
}

func main() {
    subscription := "066f42e4f6a2404fbc4f9ec52ffbe2a1"
    region := "southeastasia"

    audioConfig, err := audio.NewAudioConfigFromDefaultMicrophoneInput()
    if err != nil {
        fmt.Println("Got an error: ", err)
        return
    }
    defer audioConfig.Close()
    config, err := speech.NewSpeechConfigFromSubscription(subscription, region)
    if err != nil {
        fmt.Println("Got an error: ", err)
        return
    }
    config.SetSpeechRecognitionLanguage("zh-CN")
    config.RequestWordLevelTimestamps()

    defer config.Close()
    speechRecognizer, err := speech.NewSpeechRecognizerFromConfig(config, audioConfig)
    if err != nil {
        fmt.Println("Got an error: ", err)
        return
    }
    defer speechRecognizer.Close()

    speechRecognizer.SessionStarted(sessionStartedHandler)
    speechRecognizer.SessionStopped(sessionStoppedHandler)
    speechRecognizer.Recognizing(recognizingHandler)
    speechRecognizer.Recognized(recognizedHandler)
    speechRecognizer.Canceled(cancelledHandler)
    speechRecognizer.StartContinuousRecognitionAsync()

    defer speechRecognizer.StopContinuousRecognitionAsync()

    bufio.NewReader(os.Stdin).ReadBytes('\n')
}
dargilco commented 7 months ago

@wxbool can you please enable Speech SDK logs, do a single run and share the log here? Thanks! https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-use-logging . The log will confirm if word timing is part of the JSON recognition result web socket message sent from the service. If so, you will need to access the raw JSON string and parse it yourself to get the word timing from it, as it does not look like we have it exposed in the result object. I'll try to get more info on how to do that.

dargilco commented 7 months ago

@wxbool this is an example of how you would do it in Java. I'm trying to see if something similar can be done in GO. https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/java/jre/console/src/com/microsoft/cognitiveservices/speech/samples/console/SpeechRecognitionSamples.java#L122

dargilco commented 7 months ago

@wxbool please do this to get the JSON string from the recognition result object: result.Properties.GetProperty(common.SpeechServiceResponseJSONResult, "")

Let me know if that worked for you and you see the word-level timing there.