jambonz / jambonz-webapp

A simple provisioning web app for jambonz
MIT License
5 stars 21 forks source link

In recording popup window calculate speech recognizer latency #327

Closed davehorton closed 11 months ago

davehorton commented 12 months ago

In the recording window when we check "Overlay STT and DTMF events" we get a picture that gives us a rough sense of the latency of the speech recognizer: image

In that picture, the latency is the length of time between the end of speech energy in light red horizontal bar and the end of that horizontal bar -- the bar (span) ends when we get a transcript back from the recognizer. So we can kind of see that the latency was a bit less than a second, but it is not precisely measured. We must try to figure out how we can make a precise calculation of the latency and show it in the popup: image i.e. in the case where there is a transcript in the popup we should include a "latency" field that shows the calculated latency of the speech recognition service in seconds (with fractions to the millisecond).

This will be challenging because we need to determine the start point to measure from - that point where the speech energy goes to zero. There is no event in the trace that will give us that time point, so I see two possible solutions:

  1. We post-process the audio stream in the webapp to calculate speech energy and determine the all of the time points where silence is detected following speech. We then find the last such time point within the light red bar / span and use that as the start time in calculating the latency, calculate the latency and display it in the popup.
  2. If we can't do the above (which would be preferable) we could allow the user to click on the timeline within the transcribe span and use that time point as the start of the calculation. Option 1 is certainly preferred.
davehorton commented 12 months ago

There is another more "quick and dirty" option we could consider for the near term. If we allowed the user to click and drag on the timeline to create a segment, they could click at the start of the silence period and drag and release when the transcript is returned. When they release we could calculate the length of time represented by the rectangle they just created and display that in a popup or some sort of annotation.

This is less desirable than actually processing the audio to calculate the latencies in the recording, but could be a short term fix if the better solution is not able to be accomplished