jabber-tools / cognitive-services-speech-sdk-rs

Apache License 2.0
24 stars 15 forks source link

Word/phrase level timestamp support possible? #2

Closed ghost closed 2 years ago

ghost commented 2 years ago

https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/665 Hi, I'd like to use Word/phrase level timestamp as shown in issues above, is there any possibility to support it?

adambezecny commented 2 years ago

hi,

recognition result should contain duration and offset attributes, see here: https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_recognition_result.rs#L19-L20

right now these are defined as string (probably I should change to proper type) but it should work. Did you try it? Does it return these attributes?

ghost commented 2 years ago

Yes, I can use the offset and duration of the entire utterance. However, I would like to use each word and its offset and duration as shown below.

{
    "Id": "791d3f8a724846f69e9d9256947d2479",
    "RecognitionStatus": "Success",
    "Offset": 500000,
    "Duration": 13000000,
    "DisplayText": "What's the weather like?",
    "NBest": [
        {
            "Confidence": 0.97660327,
            "Lexical": "what's the weather like",
            "ITN": "what's the weather like",
            "MaskedITN": "what's the weather like",
            "Display": "What's the weather like?",
            "Words": [
                {
                    "Word": "what's",
                    "Offset": 500000,
                    "Duration": 3900000
                },
                {
                    "Word": "the",
                    "Offset": 4500000,
                    "Duration": 1300000
                },
                {
                    "Word": "weather",
                    "Offset": 5900000,
                    "Duration": 2900000
                },
                {
                    "Word": "like",
                    "Offset": 8900000,
                    "Duration": 4600000
                }
            ]
        },

According to https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/665, if I call RequestWordLevelTimestamps and set OutputFormat to Detailed, I can get the word level timestamp. https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_config.rs#L245-L250 https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_config.rs#L324-L335

I called RequestWordLevelTimestamps and set OutputFormat to Detailed, but I could not get the NBest. So to get the NBest, we need to add the NBest field here. https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_recognition_result.rs#L19-L20

adambezecny commented 2 years ago

hi

no need to enhance the struct SpeechRecognitionResult in any way. Just do exactly same as they advice in above mentioned issue 665, i.e.:

  1. set request_word_level_timestamps on your speech config object
  2. set set_get_output_format to OutputFormat::Detailed on your speech config object
  3. in recognized callback (available only in recognized callback, not in recognizing callback!) read event result property like this event.result.properties.get_property(PropertyId::SpeechServiceResponseJsonResult, "N/A")
  4. 4 That's it! You will get the very same JSON as they describe above, I just tested it and it works fine

Let me know should you have any problems with it, I just used one of provided examples to make this work with above mentioned tweaks.

ghost commented 2 years ago

I tried the above method and got the desired result. Thank you so much!

adambezecny commented 2 years ago

glad to help, closing the issue now.