Bug Report PHP: Google Cloud Speech to Text API not Recognizing all Speakers

hummbugg commented 4 years ago

Environment details

OS: Windows 10
PHP version: 7.4.7RC1
Package name and version: googleapis/google-cloud-php v0.138.0

Steps to reproduce

I am currently using Version 1.2.1 according to the vendor\google\cloud\Speech\VERSION file. The Speech API was Installed via "composer require google/cloud" as part of the full cloud API.

I suspect the problem could be related to speakerTag always being zero and some ongoing code changes related to differentiating multiple speaker's voice characteristics are missing some code under certain scenarios.

The thing I am concerned about is that not all people speaking in the audio are being recognized and transcribed. For example, I have an audio wave file that has several people speaking. 1) Teacher 2) Little Boy #1 3) Little Girl #1 4) Little Girl #2 5) Little Girl #3 6) Little Boy #2

The teacher is the first to speak followed by Little Girl #1 followed by Little Girl #2 followed by Little Boy #2

All voices were recognized and transcribed with the exception of Little Girl #1, in fact throughout the entire video Little Girl #1 who speaks very clearly is never transcribed!

Here is a link to the video that I posted with closed captions that I created from the Google Speech API to test of the API: https://vimeo.com/455662126/5610c6b265 I addition there are several words that were not correct and YouTube Auto CC generator gets them right.

The audio wave file that I used as a source was extracted from the MP4 video using:

ffmpeg -i "03 Joining In Questions Comments.mp4" -ar 48000 -ac 1 "03 Joining In Questions Comments.wav"

SOURCE VIDEO (download to the same directory as the PHP script below) Here is a download link to the original "03 Joining In Questions Comments.mp4": https://content.streamhoster.com/file/apsva/03_Joining_In_Questions_Comments.mp4?dl=1

Here are two versions of the audio source wave test files, both render the exact same text from the Google Speech API :

The original as output from ffmpeg: https://content.streamhoster.com/file/apsva2/03_Joining_In_Questions_Comments_orginal.wav?dl=1
Copy of the original with a noise filter applied by Audacity to reduce the background hiss: https://content.streamhoster.com/file/apsva2/03_Joining_In_Questions_Comments_noise_reduction.wav?dl=1

I uploaded to YouTube and it's auto-generated closed captions: https://www.youtube.com/watch?v=_hyET4U2xcM

Running the PHP Instructions

Download the original audio source wave test file from the link (1) above into the same directory as the PHP script below.
Be sure that the $audioFile variable matches the file name of the downloaded audio source test file.
Change all "BUCKET-NAME" to your test bucket name.
Run the script and an output file will be created in the same directory with the same name as the source only with a ".srt" file extension.
Open the mp4 video file you downloaded from "SOURCE VIDEO" above using the VLC player (note: the ".mp4" file should have the same name as the ".srt" that was output from the script). VLC will detect the srt file automatically and show the closed caption subtitles created by the Speech to Text as the video plays. This makes it easier to see which person in the video is speaking and which text was not recognized and transcribed due to the bug that I am reporting.

At this point you should have the following files all in the same directory:

My_Script.php 03 Joining In Questions Comments.mp4 (downloaded) 03 Joining In Questions Comments.srt (created by running My_Script.php) 03 Joining In Questions Comments.wav (downloaded audio source)

If you don't have VLC you can download it here: https://www.videolan.org/vlc/download-windows.html

Important Notice!!! I have also used other audio source files and some unexpected text is inserted into the results. This audio file does not contain the acronym "BFF" (meaning "best friends forever") being said anywhere however it appears in the results! I am going to open another ticket on this problem that has better examples of text insertion coming from the server maybe from a thesaurus database or something.

Code example

<?php

// YouTube CC max 40 characters per line
// Vimeo CC max 32 characters per line

# Includes the autoloader for libraries installed with composer
//require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/vendor/autoload.php';

# Imports the Google Cloud client library
use Google\Cloud\Speech\V1\SpeechClient;
use Google\Cloud\Speech\V1\RecognitionAudio;
use Google\Cloud\Speech\V1\RecognitionConfig;
use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;
use Google\Cloud\Speech\V1\SpeakerDiarizationConfig;
use Google\Cloud\Storage\StorageClient;

# The name of the audio file to transcribe
$audioFile  = '03 Joining In Questions Comments.wav';
//$audioFile = 'Ellen and Steve Harvey Talk to Kids.wav';
//Ellen and Steve Harvey Talk to Kids
//echo $audioFile . '<br/>';
$outputFile = basename($audioFile, ".wav") . '.srt';

// Extract audio from pm4 via ffmpeg (uncomment localhost only)
//exec('ffmpeg -i "Ellen and Steve Harvey Talk to Kids.mp4" -ar 48000 -ac 1 "Ellen and Steve Harvey Talk to Kids.wav"');
//die;

// Create a Storage Client
$storage = new StorageClient();
$bucket  = $storage->bucket('lgr-aps-s2t-srt');
// Upload the audio wav file to the bucket.
$bucket->upload(fopen($audioFile, 'r'));

// Process file from bucket
$uri = "gs://lgr-aps-s2t-srt/$audioFile";

// Delete output file because it will be appended otherwise
if (file_exists($outputFile)) {
    unlink($outputFile);
}

# set string as audio content
$audio = new RecognitionAudio();

# Use uri for bucket or content for local file
//$audio->setContent($content);
$audio->setUri($uri);

# Diarization Configuration Settings
# NOTE Only setEnableSpeakerDiarization changes output where 
# True outputs all it can but not all speakers
# False outputs the same data as True only the first 50% of text????
# Any permutation of setMinSpeakerCount and setMaxSpeakerCount have absolutly no effect
$diarizationConfig = new SpeakerDiarizationConfig();
$diarizationConfig->setEnableSpeakerDiarization(true); // Needs lots of work
$diarizationConfig->setMinSpeakerCount(1); // Ignored
$diarizationConfig->setMaxSpeakerCount(6); // Ignored

# Recognition Configuration Settings
$recognitionConfig = new RecognitionConfig();
//$recognitionConfig->setEncoding(AudioEncoding::LINEAR16); // Not needed let it inherit from source
//$recognitionConfig->setSampleRateHertz(44100); // Not needed let it inherit from source
$recognitionConfig->setLanguageCode('en-US'); // untested
$recognitionConfig->setEnableWordTimeOffsets(true); // Works!
$recognitionConfig->setEnableAutomaticPunctuation(true); // Works!
$recognitionConfig->setDiarizationConfig($diarizationConfig);
$recognitionConfig->setAudioChannelCount(1); // Works, however Diarization of stereo is not coded???  
$recognitionConfig->setMaxAlternatives(10); // No effect there is only 1 returned from server "Alternatives[0]" (iterator miscoded?)
$recognitionConfig->setProfanityFilter(false); // untested

$config = $recognitionConfig;

# Instantiates a client
$client = new SpeechClient();

# Detects speech in the audio file
$operation = $client->LongRunningRecognize($config, $audio);
$operation->pollUntilComplete();

// Initialize variables used in SRT routines
$lineCounter              = 0;
$sampleTimeMiliseconds    = 2200;
$lastStartTimeMiliseconds = 0;
$wordStack                = '';
$firstLine                = true;
$wordCount                = 0;
$characterCount           = 0;

if ($operation->operationSucceeded()) {
    $response = $operation->getResult();

    // each result is for a consecutive portion of the audio. iterate
    // through them to get the transcripts for the entire audio file.
    foreach ($response->getResults() as $result) {
        //var_dump($result);
        //die;
        $alternatives = $result->getAlternatives();
        $mostLikely   = $alternatives[0];
        $transcript   = $mostLikely->getTranscript();
        $confidence   = $mostLikely->getConfidence();
        foreach ($mostLikely->getWords() as $wordInfo) {
            $startTime  = $wordInfo->getStartTime();
            $endTime    = $wordInfo->getEndTime();
            $speakerTag = $wordInfo->getSpeakerTag();
            //echo '$speakerTag = ' . $speakerTag . '<br/>';
            $theWord    = $wordInfo->getWord();
            echo $theWord . ' ';
            $startTimeMiliseconds = convertSecondsToMiliseconds($startTime->serializeToJsonString());
            $endTimeMiliseconds   = convertSecondsToMiliseconds($endTime->serializeToJsonString());

            if ($firstLine == true) {
                $lastStartTimeMiliseconds = $startTimeMiliseconds;
                $firstLine                = false;
            }

            //if( ($endTimeMiliseconds - $lastStartTimeMiliseconds > $sampleTimeMiliseconds) || ($wordCount > 5) || $characterCount > 32 ){
            if (($endTimeMiliseconds - $lastStartTimeMiliseconds > $sampleTimeMiliseconds) || $characterCount > 32) {
                $wordStack .= $theWord . ' ';
                writeLine($lastStartTimeMiliseconds, $endTimeMiliseconds, $wordStack, $speakerTag, $lineCounter, $outputFile);
                $wordStack      = '';
                $wordCount      = 0;
                $characterCount = 0;
                $firstLine      = true;
            } else {
                $wordStack .= $theWord . ' ';
                $characterCount = strlen($wordStack);
                $wordCount += 1;
            }
        }
    }
} else {
    print_r($operation->getError());
}

$client->close();

// Flush last line
writeLine($lastStartTimeMiliseconds, $endTimeMiliseconds, $wordStack, $speakerTag, $lineCounter, $outputFile);

/////////////////////////////////////////////
//Functions
/////////////////////////////////////////////
function writeLine($lastStartTimeMiliseconds, $currentEndTimeMiliseconds, $wordStack, $speakerTag, &$lineCounter, $outputFile)
{
    $lineCounter += 1;
    $startTimeFormatted = formatMilliseconds($lastStartTimeMiliseconds);
    $endTimeFormatted   = formatMilliseconds($currentEndTimeMiliseconds);
    $record             = sprintf('%s~{"start":"%s","end":"%s","speakerid":"%s"}' . PHP_EOL, $wordStack, $startTimeFormatted, $endTimeFormatted, $speakerTag);
    $arrayLine          = explode("~", $record);
    $srtWord            = $arrayLine[0];
    $srtJSON            = $arrayLine[1];
    $arraySRT           = json_decode($srtJSON);
    $outputLineBuffer   = $lineCounter . "\r\n";
    $outputLineBuffer .= $arraySRT->start . ' --> ' . $arraySRT->end . "\r\n";
    $outputLineBuffer .= $wordStack . "\r\n" . "\r\n";
    file_put_contents($outputFile, $outputLineBuffer, FILE_APPEND | LOCK_EX);
    $wordStack = $srtWord;
}

function convertSecondsToMiliseconds($seconds)
{
    $seconds = str_replace("s", "", $seconds);
    $seconds = str_replace('"', "", $seconds);
    if (strpos($seconds, ".") === false) {
        $seconds .= ".000";
    }
    $pieces           = explode(".", $seconds);
    $secondsPiece     = $pieces[0];
    $milisecondsPiece = $pieces[1];
    $totalMiliseconds = ((int) $secondsPiece * 1000) + $milisecondsPiece;
    return $totalMiliseconds;
}

function formatMilliseconds($milliseconds)
{
    $seconds      = floor($milliseconds / 1000);
    $minutes      = floor($seconds / 60);
    $hours        = floor($minutes / 60);
    $milliseconds = $milliseconds % 1000;
    $seconds      = $seconds % 60;
    $minutes      = $minutes % 60;
    $format       = '%02u:%02u:%02u';
    $time         = sprintf($format, $hours, $minutes, $seconds) . ',' . str_pad($milliseconds, 3, '0', STR_PAD_RIGHT);
    return $time;
}

?>
# example

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

hummbugg commented 4 years ago

Hi David Supplee, I noticed that you marked this issue as a question. This is actually a bug I am trying to report. Is there someone who specializes in the Speech API that could possibly take a look at this problem by running the code I have supplied?

Currently this bug makes the Speech API completely useless. I suspect that at some point in time it may have been working fine until someone made changes to the code related to multiple speaker recognitions within a single audio clip. Most APIs for example IBM Watson Speech API and Amazon Speech API by default will return a single transcript without any speaker separation. In these cases all the words spoken by anyone are returned in a single transcript with no speaker identifiers.

This Google Speech API by default is trying to create a transcript that specifies individual speakers but fails miserably by excluding higher pitched voices and in some cases the first word spoken by a particular speaker. It has no ability to perform in the default mode described in the paragraph above, instead it is stuck in multi-speaker recognition mode.

Please help direct this to the appropriate person who might be on the Speech API Team.

Thank you all for listening, I will continue to use IBM Watson until this problem can be resolved and am looking forward to using this API in the near future because it appears to have a lot of promise to improve my applications for the hearing impaired. Thanks Again

dwsupplee commented 3 years ago

@hummbugg,

Would you be able to open an issue against the public issue tracker used for the Speech API? It can be found here. This repository leans more towards issues with the client libraries themselves, and after some more review it looks like the problem you're encountering may be with the API itself. If I've got that incorrect please do re-open this issue and we can look further in to whether this is an issue with the code hosted here. Thanks for your time and the detailed issue.

googleapis / google-cloud-php