Closed hummbugg closed 3 years ago
Hi David Supplee, I noticed that you marked this issue as a question. This is actually a bug I am trying to report. Is there someone who specializes in the Speech API that could possibly take a look at this problem by running the code I have supplied?
Currently this bug makes the Speech API completely useless. I suspect that at some point in time it may have been working fine until someone made changes to the code related to multiple speaker recognitions within a single audio clip. Most APIs for example IBM Watson Speech API and Amazon Speech API by default will return a single transcript without any speaker separation. In these cases all the words spoken by anyone are returned in a single transcript with no speaker identifiers.
This Google Speech API by default is trying to create a transcript that specifies individual speakers but fails miserably by excluding higher pitched voices and in some cases the first word spoken by a particular speaker. It has no ability to perform in the default mode described in the paragraph above, instead it is stuck in multi-speaker recognition mode.
Please help direct this to the appropriate person who might be on the Speech API Team.
Thank you all for listening, I will continue to use IBM Watson until this problem can be resolved and am looking forward to using this API in the near future because it appears to have a lot of promise to improve my applications for the hearing impaired. Thanks Again
@hummbugg,
Would you be able to open an issue against the public issue tracker used for the Speech API? It can be found here. This repository leans more towards issues with the client libraries themselves, and after some more review it looks like the problem you're encountering may be with the API itself. If I've got that incorrect please do re-open this issue and we can look further in to whether this is an issue with the code hosted here. Thanks for your time and the detailed issue.
Environment details
Steps to reproduce
I am currently using Version 1.2.1 according to the vendor\google\cloud\Speech\VERSION file. The Speech API was Installed via "composer require google/cloud" as part of the full cloud API.
I suspect the problem could be related to speakerTag always being zero and some ongoing code changes related to differentiating multiple speaker's voice characteristics are missing some code under certain scenarios.
The thing I am concerned about is that not all people speaking in the audio are being recognized and transcribed. For example, I have an audio wave file that has several people speaking. 1) Teacher 2) Little Boy #1 3) Little Girl #1 4) Little Girl #2 5) Little Girl #3 6) Little Boy #2
The teacher is the first to speak followed by Little Girl #1 followed by Little Girl #2 followed by Little Boy #2
All voices were recognized and transcribed with the exception of Little Girl #1, in fact throughout the entire video Little Girl #1 who speaks very clearly is never transcribed!
Here is a link to the video that I posted with closed captions that I created from the Google Speech API to test of the API: https://vimeo.com/455662126/5610c6b265 I addition there are several words that were not correct and YouTube Auto CC generator gets them right.
The audio wave file that I used as a source was extracted from the MP4 video using:
ffmpeg -i "03 Joining In Questions Comments.mp4" -ar 48000 -ac 1 "03 Joining In Questions Comments.wav"
SOURCE VIDEO (download to the same directory as the PHP script below) Here is a download link to the original "03 Joining In Questions Comments.mp4": https://content.streamhoster.com/file/apsva/03_Joining_In_Questions_Comments.mp4?dl=1
Here are two versions of the audio source wave test files, both render the exact same text from the Google Speech API :
I uploaded to YouTube and it's auto-generated closed captions: https://www.youtube.com/watch?v=_hyET4U2xcM
Running the PHP Instructions
At this point you should have the following files all in the same directory:
My_Script.php 03 Joining In Questions Comments.mp4 (downloaded) 03 Joining In Questions Comments.srt (created by running My_Script.php) 03 Joining In Questions Comments.wav (downloaded audio source)
If you don't have VLC you can download it here: https://www.videolan.org/vlc/download-windows.html
Important Notice!!! I have also used other audio source files and some unexpected text is inserted into the results. This audio file does not contain the acronym "BFF" (meaning "best friends forever") being said anywhere however it appears in the results! I am going to open another ticket on this problem that has better examples of text insertion coming from the server maybe from a thesaurus database or something.
Code example
Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!