Closed shanbhardwaj closed 2 months ago
This is by design, and the change suggested would indeed introduce a bug. I suggest you add a partialResultHook
to your gather in which case you should receive all partial results as they come and you can implement whatever logic you want
@davehorton Attached here are the logs for the following configuration
transcribe: {
transcriptionHook: '/transcribe',
recognizer: {
vendor: process.env.JAMBONZ_STT_VENDOR,
language: 'en-US',
separateRecognitionPerChannel: true,
interim: true,
deepgramOptions: {
model: 'nova-2',
punctuate: true,
redact: 'pci',
diarize: true,
numerals: true,
utteranceEndMs: 1000,
smartFormatting: true,
},
},
},
The logs are here freeswitch_logs.txt feature_server_logs.txt
Following is the logs of the transcripts that were received when the logs were taken, which show the interim results are hard to use as they don't match the buffered transcript.
The is_final
node is always false for the interim results received from jambonz feature server. It is set to true only when the buffered transcript is sent on utterance end.
is_final false
Interim Transcript: If you know your party's
speech_final: false
is_final false
Interim Transcript: If you know your party's extension, you *
speech_final: false
is_final false
Interim Transcript: Press * for
speech_final: false
is_final false
Interim Transcript: Press * for support.
speech_final: false
is_final false
Interim Transcript: Press * for bill
speech_final: false
is_final false
Interim Transcript: Press * to speak
speech_final: false
is_final false
Interim Transcript: Press * to speak with our retention
speech_final: false
is_final false
Interim Transcript: Press * if
speech_final: false
is_final false
Interim Transcript: Press * if you are looking to start an
speech_final: false
is_final false
Interim Transcript: Press * if
speech_final: false
is_final false
Interim Transcript: Press * if you need to cancel service.
speech_final: false
is_final false
Interim Transcript: Press * to
speech_final: false
is_final false
Interim Transcript: Press * to speak
speech_final: false
is_final false
Interim Transcript: Press * to speak
speech_final: false
is_final false
Interim Transcript: Press * to speak with the front desk.
speech_final: false
is_final true
Interim Transcript: If you know your party's extension, you may enter that now Press 1 for sales Press 2 for support Press 3 for billing Press 4 to speak with our retention department Press 5 if you are looking to start an account with us Press 6 if you need to cancel service Press 7 to open a ticket Press 8 to speak with management Press 9 to speak with the front desk
the logs actually show something different. I see that we are sending you a POST with interim transcripts after each of these partial sentences:
2024-06-04 05:36:10.800 TaskGather:_onTranscription - got transcript during continous asr
{
"confidence": 0.9996904,
"transcript": "If you know your party's extension, you may enter that now"
}
2024-06-04 05:36:11.977 HttpRequestor:request POST /transcribe succeeded in 256ms
..
2024-06-04 05:36:12.281 TaskGather:_onTranscription - got transcript during continous asr
{
"confidence": 0.99775404,
"transcript": "Press 1 for sales"
}
[2024-06-04 05:36:13.539 HttpRequestor:request POST /transcribe succeeded in 256ms
..
2024-06-04 05:36:13.642 TaskGather:_onTranscription - got transcript during continous asr
{
"confidence": 0.99837804,
"transcript": "Press 2 for support"
}
2024-06-04 05:36:14.854 HttpRequestor:request POST /transcribe succeeded in 251ms
..
2024-06-04 05:36:15.043 TaskGather:_onTranscription - got transcript during continous asr
{
"confidence": 0.9983664,
"transcript": "Press 3 for billing"
}
2024-06-04 05:36:16.284 HttpRequestor:request POST /transcribe succeeded in 259ms
..
2024-06-04 05:36:17.426 TaskGather:_onTranscription - got transcript during continous asr
{
"confidence": 0.9996513,
"transcript": "Press 4 to speak with our retention department"
}
2024-06-04 05:36:18.620 HttpRequestor:request POST /transcribe succeeded in 253ms
etc..all the way to "press 9"
So it seems to me that you are in fact getting the transcripts in the chunks that you want. The is_final
property will be false, yes, because we want to send a final complete utterance once we get end of utterance event -- because you set that. This is exactly how Deepgram have recommended us to do.
For your needs, which are to receive each of the sentence fragments when they are finalized, regardless of whether the full user utterance is finalized, you should be able to process these interim results that it seems we are sending you. So the first thing is to recheck your websocket application to see if you are discarding these.
Secondly, I think I would recommend that you possibly not set utterance_end_ms, since it appears what you really want are sentence fragments that are finalized.
But the first thing is to figure out why you feel you are not receiving these interim results when the log clearly seems to indicate we are sending them.
actually, it would be useful if you would redo this test, but with the feature-server log level at debug. You had them at info for the logs above
and for this test, please do not set utteranceEndMs
.
@davehorton Attached here are the logs without the utteranceEndMs
setting and interim: true
Also we noticed that if the endpointing
key is present in the options then we do not get the is_final=true till the buffered transcript is received, no matter if the value is enpointing: false
freeswitch_logs_interim_true.txt feature_server_logs_interim_true.txt
@shanbhardwaj the provided logs for feature server were at info level. Please redo and submit debug logs
@davehorton Please find attached the debug logs for the feature server, along with the freeswitch logs for a call made today. feature_server_logs_interim_true_debug.txt freeswitch_logs_interim_true_1.txt
OK, so these transcripts look like you should be getting exactly what you are looking for. We are passing the transcripts on as we get them, and if you look at the final transcripts in the log summary below (bolded) you can see you are getting these with no buffering. Please let me know if you see a problem but you should be all set.
One note: I do think that enabling redaction is causing Deepgram to have increased latency in returning final transcripts. You might want to turn that off.
10:59:54.266 - get interim transcript (is_final=false, speech_final=false) "If you know your party's" 10:59:54.266 - send interim transcript to webhook app
10:59:55.287 - get interim transcript (is_final=false, speech_final=false) "If you know your party's extension, you * enter" 10:59:55.287 - send interim transcript to webhook app
10:59:55.907 - get final transcript (is_final=true, speech_final=true) "If you know your party's extension, you may enter that now." 10:59:55.907 - send final transcript to webhook app
10:59:56.928 - get interim transcript (is_final=false, speech_final=false) "Press * for" 10:59:56.928 - send interim transcript to webhook app
10:59:57.489 - get final transcript (is_final=true, speech_final=true) "Press 1 for sales," 10:59:57.489 - send final transcript to webhook app
10:59:58.569 - get interim transcript (is_final=false, speech_final=false) "Press * for support." 10:59:58.569 - send interim transcript to webhook app
10:59:58.830 - get final transcript (is_final=true, speech_final=true) "Press 2 for support." 10:59:58.830 - send final transcript to webhook app
10:59:59.790 - get interim transcript (is_final=false, speech_final=false) "Press * for bill" 10:59:59.790 - send interim transcript to webhook app
11:00:00.251 - get final transcript (is_final=true, speech_final=true) "Press 3 for billing." 11:00:00.251 - send final transcript to webhook app
11:00:01.252 - get interim transcript (is_final=false, speech_final=false) "Press * to speak" 11:00:01.252 - send interim transcript to webhook app
11:00:02.252 - get interim transcript (is_final=false, speech_final=false) "Press * to speak with our retention" 11:00:02.253 - send interim transcript to webhook app
11:00:02.653 - get final transcript (is_final=true, speech_final=true) "Press 4 to speak with our retention department.", 11:00:02.653 - send final transcript to webhook app
11:00:03.694 - get interim transcript (is_final=false, speech_final=false) "Press * if" 11:00:03.694 - send interim transcript to webhook app
11:00:04.554 - get interim transcript (is_final=false, speech_final=false) "Press * if you are looking to start an" 11:00:04.555 - send interim transcript to webhook app
11:00:05.215 - get final transcript (is_final=true, speech_final=true) "Press 5 if you are looking to start an account with us.", 11:00:05.215 - send final transcript to webhook app
11:00:06.316 - get interim transcript (is_final=false, speech_final=false) "Press * if" 11:00:06.316 - send interim transcript to webhook app
11:00:07.197 - get interim transcript (is_final=false, speech_final=false) "Press * if you need to cancel service."" 11:00:07.197 - send interim transcript to webhook app
11:00:07.637 - get final transcript (is_final=true, speech_final=true) "Press 6 if you need to cancel service.", 11:00:07.637 - send final transcript to webhook app
11:00:08.618 - get interim transcript (is_final=false, speech_final=false) "Press * to" 11:00:08.618 - send interim transcript to webhook app
11:00:09.379 - get final transcript (is_final=true, speech_final=true) "Press 7 to open a ticket.", 11:00:09.379 - send final transcript to webhook app
11:00:10.359 - get interim transcript (is_final=false, speech_final=false) "Press * to speak" 11:00:10.359 - send interim transcript to webhook app
11:00:11.320 - get final transcript (is_final=true, speech_final=true) "Press 8 to speak with management.", 11:00:11.320 - send final transcript to webhook app
11:00:12.341 - get interim transcript (is_final=false, speech_final=false) "Press * to speak" 11:00:12.341 - send interim transcript to webhook app
11:00:13.302 - get interim transcript (is_final=false, speech_final=false) "Press * to speak to the front des3" 11:00:13.302 - send interim transcript to webhook app
11:00:13.402 - get final transcript (is_final=true, speech_final=true) "Press 9 to speak with the front desk.", 11:00:13.402 - send final transcript to webhook app
11:00:20.367 - get interim transcript (is_final=false, speech_final=false) "If you know your party's" 11:00:20.368 - send interim transcript to webhook app
11:00:21.548 - get interim transcript (is_final=false, speech_final=false) "If you know your party's extension, you * enter it now" 11:00:21.549 - send interim transcript to webhook app
11:00:22.249 - get final transcript (is_final=true, speech_final=true) "If you know your party's extension, you may enter that now.", 11:00:22.249 - send final transcript to webhook app
@simmibadhan please respond on the ticket to the above observations
@davehorton We are partly getting what we are looking for.
is_final=true
: We need this for the finalized stt output for a time range (say 0-4secs). We are getting this with interim results speech_final=true
: We need this for finalized end of speech input, which can be many sentences from 1. above, ie. the combination of many is_final=true
sentences. However in our interim results we always have is_final =true and speech_final=true
. As a result we are loosing the end of speech information. speech_final=true
on the enpointing interval that we used for end of speech detection. refWhen Deepgram identifies that an endpoint has been reached, the response is marked as speech_final: true. The only purpose of the speech_final marker is to tell you where an endpoint has been found, which indicates that a significantly long pause has been detected.
Also according to deepgram ref
When using endpointing with interim results, remember:
When speech_final is true, then is_final will always also be true.
When is_final is true, then speech_final may or may not be true.
A second way that deepgram suggests to detect end of speech is using the UtteranceEnd
message ref. We were using this with endpointing in the custom-set integration with deepgram, however we are not getting that marker now and if we use the UtteranceEnd setting then we loose the interim results.
In the example above, you did not set endpointing so you get the default endpointing value of 100 ms. You are then getting all of the deepgram transcripts as they come. You can look at the vendor
property and get the exact transcript as it is sent to us. We are not changing anything. You can also change the endpointing value to a different value if you like.
There is no bug that I see with regards to the is_final
and speech_final
properties, because we are sending exactly what deepgram is sending.
The only thing you are not getting is an Utterance End event forwarded to you. If you want to narrow the discussion to that, we can do that. If you still feel there is some bug with is_final and speech_final then you need to provide data that disproves the data I have provided you.
When we set endpointing to 10 or 500. Logs are attached here for both cases.
freeswitch_logs_interim_true_endpointing_10.txt feature_server_logs_interim_true_interim_10.txt feature_server_logs_interim_true_endpointing_500.txt freeswitch_logs_interim_true_endpointing_500.txt
By no means am I suggesting this is a bug, it's just the current behavior. Yes we can narrow the discussion to Utterance End event. We just want to solve our problem of getting end of speech one way or another.
@shanbhardwaj please test the PR above #782
If you request interim transcripts, and you are using Deepgram, and you set utterance_end_ms (and don't enable continuous asr) you should get an additional event sent to you when we get an UtteranceEnd event from Deepgram.
This event will be distinguished from a transcription event because it will not have a speech
property, instead it will have a speechEvent
property.
Please test and send logs from the feature server side (at debug).
if
utteranceEndMs
orendpointing
is set for deepgram then the finalized sentences are not received till theUtteranceEnd
orspeech_final=true
marker is received at the feature server. This causes a delayed experience at the users end.This buffering here is causing us an issue. https://github.com/jambonz/jambonz-feature-server/blob/main/lib/tasks/gather.js#L929