Buffering for deepgram is causing delayed transcripts

shanbhardwaj commented 3 months ago

if utteranceEndMs or endpointing is set for deepgram then the finalized sentences are not received till the UtteranceEnd or speech_final=true marker is received at the feature server. This causes a delayed experience at the users end.

This buffering here is causing us an issue. https://github.com/jambonz/jambonz-feature-server/blob/main/lib/tasks/gather.js#L929

davehorton commented 3 months ago

This is by design, and the change suggested would indeed introduce a bug. I suggest you add a partialResultHook to your gather in which case you should receive all partial results as they come and you can implement whatever logic you want

shanbhardwaj commented 3 months ago

@davehorton Attached here are the logs for the following configuration


transcribe: {
              transcriptionHook: '/transcribe',
              recognizer: {
                vendor: process.env.JAMBONZ_STT_VENDOR,
                language: 'en-US',
                separateRecognitionPerChannel: true,
                interim: true,
                deepgramOptions: {
                  model: 'nova-2',
                  punctuate: true,
                  redact: 'pci',
                  diarize: true,
                  numerals: true,
                  utteranceEndMs: 1000,
                  smartFormatting: true,
                },
              },
            },

The logs are here freeswitch_logs.txt feature_server_logs.txt

Following is the logs of the transcripts that were received when the logs were taken, which show the interim results are hard to use as they don't match the buffered transcript.

The is_final node is always false for the interim results received from jambonz feature server. It is set to true only when the buffered transcript is sent on utterance end.

is_final false
Interim Transcript: If you know your party's
speech_final:  false

is_final false
Interim Transcript: If you know your party's extension, you *
speech_final:  false

is_final false
Interim Transcript: Press * for
speech_final:  false

is_final false
Interim Transcript: Press * for support.
speech_final:  false

is_final false
Interim Transcript: Press * for bill
speech_final:  false

is_final false
Interim Transcript: Press * to speak
speech_final:  false

is_final false
Interim Transcript: Press * to speak with our retention
speech_final:  false

is_final false
Interim Transcript: Press * if
speech_final:  false

is_final false
Interim Transcript: Press * if you are looking to start an
speech_final:  false

is_final false
Interim Transcript: Press * if
speech_final:  false

is_final false
Interim Transcript: Press * if you need to cancel service.
speech_final:  false

is_final false
Interim Transcript: Press * to
speech_final:  false

is_final false
Interim Transcript: Press * to speak
speech_final:  false

is_final false
Interim Transcript: Press * to speak
speech_final:  false

is_final false
Interim Transcript: Press * to speak with the front desk.
speech_final:  false

is_final true
Interim Transcript: If you know your party's extension, you may enter that now Press 1 for sales Press 2 for support Press 3 for billing Press 4 to speak with our retention department Press 5 if you are looking to start an account with us Press 6 if you need to cancel service Press 7 to open a ticket Press 8 to speak with management Press 9 to speak with the front desk

davehorton commented 3 months ago

the logs actually show something different. I see that we are sending you a POST with interim transcripts after each of these partial sentences:

2024-06-04 05:36:10.800 TaskGather:_onTranscription - got transcript during continous asr
        {
          "confidence": 0.9996904,
          "transcript": "If you know your party's extension, you may enter that now"
        }
2024-06-04 05:36:11.977 HttpRequestor:request POST /transcribe succeeded in 256ms

..

2024-06-04 05:36:12.281 TaskGather:_onTranscription - got transcript during continous asr
        {
          "confidence": 0.99775404,
          "transcript": "Press 1 for sales"
        }
[2024-06-04 05:36:13.539 HttpRequestor:request POST /transcribe succeeded in 256ms

..

2024-06-04 05:36:13.642 TaskGather:_onTranscription - got transcript during continous asr
        {
          "confidence": 0.99837804,
          "transcript": "Press 2 for support"
        }
2024-06-04 05:36:14.854 HttpRequestor:request POST /transcribe succeeded in 251ms

..

2024-06-04 05:36:15.043 TaskGather:_onTranscription - got transcript during continous asr
        {
          "confidence": 0.9983664,
          "transcript": "Press 3 for billing"
        }
2024-06-04 05:36:16.284 HttpRequestor:request POST /transcribe succeeded in 259ms

..

2024-06-04 05:36:17.426 TaskGather:_onTranscription - got transcript during continous asr
        {
          "confidence": 0.9996513,
          "transcript": "Press 4 to speak with our retention department"
        }
2024-06-04 05:36:18.620 HttpRequestor:request POST /transcribe succeeded in 253ms

etc..all the way to "press 9"

So it seems to me that you are in fact getting the transcripts in the chunks that you want. The is_final property will be false, yes, because we want to send a final complete utterance once we get end of utterance event -- because you set that. This is exactly how Deepgram have recommended us to do.

For your needs, which are to receive each of the sentence fragments when they are finalized, regardless of whether the full user utterance is finalized, you should be able to process these interim results that it seems we are sending you. So the first thing is to recheck your websocket application to see if you are discarding these.

Secondly, I think I would recommend that you possibly not set utterance_end_ms, since it appears what you really want are sentence fragments that are finalized.

But the first thing is to figure out why you feel you are not receiving these interim results when the log clearly seems to indicate we are sending them.

davehorton commented 3 months ago

actually, it would be useful if you would redo this test, but with the feature-server log level at debug. You had them at info for the logs above

davehorton commented 3 months ago

and for this test, please do not set utteranceEndMs.

shanbhardwaj commented 3 months ago

@davehorton Attached here are the logs without the utteranceEndMs setting and interim: true

Also we noticed that if the endpointing key is present in the options then we do not get the is_final=true till the buffered transcript is received, no matter if the value is enpointing: false

freeswitch_logs_interim_true.txt feature_server_logs_interim_true.txt

davehorton commented 3 months ago

@shanbhardwaj the provided logs for feature server were at info level. Please redo and submit debug logs

simmibadhan commented 3 months ago

@davehorton Please find attached the debug logs for the feature server, along with the freeswitch logs for a call made today. feature_server_logs_interim_true_debug.txt freeswitch_logs_interim_true_1.txt

davehorton commented 3 months ago

OK, so these transcripts look like you should be getting exactly what you are looking for. We are passing the transcripts on as we get them, and if you look at the final transcripts in the log summary below (bolded) you can see you are getting these with no buffering. Please let me know if you see a problem but you should be all set.

One note: I do think that enabling redaction is causing Deepgram to have increased latency in returning final transcripts. You might want to turn that off.

10:59:54.266 - get interim transcript (is_final=false, speech_final=false) "If you know your party's" 10:59:54.266 - send interim transcript to webhook app

10:59:55.287 - get interim transcript (is_final=false, speech_final=false) "If you know your party's extension, you * enter" 10:59:55.287 - send interim transcript to webhook app

10:59:55.907 - get final transcript (is_final=true, speech_final=true) "If you know your party's extension, you may enter that now." 10:59:55.907 - send final transcript to webhook app

10:59:56.928 - get interim transcript (is_final=false, speech_final=false) "Press * for" 10:59:56.928 - send interim transcript to webhook app

10:59:57.489 - get final transcript (is_final=true, speech_final=true) "Press 1 for sales," 10:59:57.489 - send final transcript to webhook app

10:59:58.569 - get interim transcript (is_final=false, speech_final=false) "Press * for support." 10:59:58.569 - send interim transcript to webhook app

10:59:58.830 - get final transcript (is_final=true, speech_final=true) "Press 2 for support." 10:59:58.830 - send final transcript to webhook app

10:59:59.790 - get interim transcript (is_final=false, speech_final=false) "Press * for bill" 10:59:59.790 - send interim transcript to webhook app

11:00:00.251 - get final transcript (is_final=true, speech_final=true) "Press 3 for billing." 11:00:00.251 - send final transcript to webhook app

11:00:01.252 - get interim transcript (is_final=false, speech_final=false) "Press * to speak" 11:00:01.252 - send interim transcript to webhook app

11:00:02.252 - get interim transcript (is_final=false, speech_final=false) "Press * to speak with our retention" 11:00:02.253 - send interim transcript to webhook app

11:00:02.653 - get final transcript (is_final=true, speech_final=true) "Press 4 to speak with our retention department.", 11:00:02.653 - send final transcript to webhook app

11:00:03.694 - get interim transcript (is_final=false, speech_final=false) "Press * if" 11:00:03.694 - send interim transcript to webhook app

11:00:04.554 - get interim transcript (is_final=false, speech_final=false) "Press * if you are looking to start an" 11:00:04.555 - send interim transcript to webhook app

11:00:05.215 - get final transcript (is_final=true, speech_final=true) "Press 5 if you are looking to start an account with us.", 11:00:05.215 - send final transcript to webhook app

11:00:06.316 - get interim transcript (is_final=false, speech_final=false) "Press * if" 11:00:06.316 - send interim transcript to webhook app

11:00:07.197 - get interim transcript (is_final=false, speech_final=false) "Press * if you need to cancel service."" 11:00:07.197 - send interim transcript to webhook app

11:00:07.637 - get final transcript (is_final=true, speech_final=true) "Press 6 if you need to cancel service.", 11:00:07.637 - send final transcript to webhook app

11:00:08.618 - get interim transcript (is_final=false, speech_final=false) "Press * to" 11:00:08.618 - send interim transcript to webhook app

11:00:09.379 - get final transcript (is_final=true, speech_final=true) "Press 7 to open a ticket.", 11:00:09.379 - send final transcript to webhook app

11:00:10.359 - get interim transcript (is_final=false, speech_final=false) "Press * to speak" 11:00:10.359 - send interim transcript to webhook app

11:00:11.320 - get final transcript (is_final=true, speech_final=true) "Press 8 to speak with management.", 11:00:11.320 - send final transcript to webhook app

11:00:12.341 - get interim transcript (is_final=false, speech_final=false) "Press * to speak" 11:00:12.341 - send interim transcript to webhook app

11:00:13.302 - get interim transcript (is_final=false, speech_final=false) "Press * to speak to the front des3" 11:00:13.302 - send interim transcript to webhook app

11:00:13.402 - get final transcript (is_final=true, speech_final=true) "Press 9 to speak with the front desk.", 11:00:13.402 - send final transcript to webhook app

11:00:20.367 - get interim transcript (is_final=false, speech_final=false) "If you know your party's" 11:00:20.368 - send interim transcript to webhook app

11:00:21.548 - get interim transcript (is_final=false, speech_final=false) "If you know your party's extension, you * enter it now" 11:00:21.549 - send interim transcript to webhook app

11:00:22.249 - get final transcript (is_final=true, speech_final=true) "If you know your party's extension, you may enter that now.", 11:00:22.249 - send final transcript to webhook app

davehorton commented 3 months ago

@simmibadhan please respond on the ticket to the above observations

shanbhardwaj commented 3 months ago

@davehorton We are partly getting what we are looking for.

is_final=true: We need this for the finalized stt output for a time range (say 0-4secs). We are getting this with interim results
speech_final=true: We need this for finalized end of speech input, which can be many sentences from 1. above, ie. the combination of many is_final=true sentences. However in our interim results we always have is_final =true and speech_final=true. As a result we are loosing the end of speech information.
When we integrate directly with deepgram, using custom stt we get speech_final=true on the enpointing interval that we used for end of speech detection. ref

When Deepgram identifies that an endpoint has been reached, the response is marked as speech_final: true. The only purpose of the speech_final marker is to tell you where an endpoint has been found, which indicates that a significantly long pause has been detected.

Also according to deepgram ref

When using endpointing with interim results, remember:

When speech_final is true, then is_final will always also be true.
When is_final is true, then speech_final may or may not be true.

A second way that deepgram suggests to detect end of speech is using the UtteranceEnd message ref. We were using this with endpointing in the custom-set integration with deepgram, however we are not getting that marker now and if we use the UtteranceEnd setting then we loose the interim results.

davehorton commented 3 months ago

In the example above, you did not set endpointing so you get the default endpointing value of 100 ms. You are then getting all of the deepgram transcripts as they come. You can look at the vendor property and get the exact transcript as it is sent to us. We are not changing anything. You can also change the endpointing value to a different value if you like.

There is no bug that I see with regards to the is_final and speech_final properties, because we are sending exactly what deepgram is sending.

The only thing you are not getting is an Utterance End event forwarded to you. If you want to narrow the discussion to that, we can do that. If you still feel there is some bug with is_final and speech_final then you need to provide data that disproves the data I have provided you.

shanbhardwaj commented 3 months ago

When we set endpointing to 10 or 500. Logs are attached here for both cases.

freeswitch_logs_interim_true_endpointing_10.txt feature_server_logs_interim_true_interim_10.txt feature_server_logs_interim_true_endpointing_500.txt freeswitch_logs_interim_true_endpointing_500.txt

By no means am I suggesting this is a bug, it's just the current behavior. Yes we can narrow the discussion to Utterance End event. We just want to solve our problem of getting end of speech one way or another.

davehorton commented 3 months ago

@shanbhardwaj please test the PR above #782

If you request interim transcripts, and you are using Deepgram, and you set utterance_end_ms (and don't enable continuous asr) you should get an additional event sent to you when we get an UtteranceEnd event from Deepgram.

This event will be distinguished from a transcription event because it will not have a speech property, instead it will have a speechEvent property.

Please test and send logs from the feature server side (at debug).

jambonz / jambonz-feature-server

Buffering for deepgram is causing delayed transcripts #772