gillesdemey / google-speech-v2

:speech_balloon: Reverse Engineering Google's Speech To Text API (v2)
468 stars 84 forks source link

PCM support #2

Closed Rudloff closed 10 years ago

Rudloff commented 10 years ago

Hello,

You say that the API support l16 PCM but I always get empty results when I send a WAV file:

$ curl -X POST --data-binary @good-morning-google.flac --header 'Content-Type: audio/x-flac; rate=44100;' 'https://www.google.com/speech-api/v2/recognize?output=json&lang=en-us&key=AIzaSyCnl6MRydhw_5fLXIdASxkLJzcJh5iX0M4'
{"result":[]}
{"result":[{"alternative":[{"transcript":"good morning Google how are you feeling today"}],"final":true}],"result_index":0}

$ avconv -i good-morning-google.flac good-morning-google.wav 
avconv version 0.8.10-6:0.8.10-1, Copyright (c) 2000-2013 the Libav developers
  built on Feb  5 2014 03:52:19 with gcc 4.7.2
Input #0, flac, from 'good-morning-google.flac':
  Duration: 00:00:02.58, bitrate: 389 kb/s
    Stream #0.0: Audio: flac, 44100 Hz, 2 channels, s16
Output #0, wav, to 'good-morning-google.wav':
  Metadata:
    encoder         : Lavf53.21.1
    Stream #0.0: Audio: pcm_s16le, 44100 Hz, 2 channels, s16, 1411 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (flac -> pcm_s16le)
Press ctrl-c to stop encoding
size=     445kB time=2.59 bitrate=1411.3kbits/s    
video:0kB audio:445kB global headers:0kB muxing overhead 0.010085%

$ curl -X POST --data-binary @good-morning-google.wav --header 'Content-Type: audio/l16; rate=44100;' 'https://www.google.com/speech-api/v2/recognize?output=json&lang=en-us&key=AIzaSyCnl6MRydhw_5fLXIdASxkLJzcJh5iX0M4'
{"result":[]}

Am I doing something wrong or should we update the doc ?

gillesdemey commented 10 years ago

According to the following snippet (taken from the Google Hotword extension for Chrome and adapted for brevity) L16 PCM should be supported, maybe it doesn't accept a .wav container?

var b = new N("https://www.google.com/speech-api/v2/recognize?output=json&lang=en-us&app=web-hotword");
      Q(b, "client", "chrome-hotword");
      Q(b, "key", "AIzaSyCnl6MRydhw_5fLXIdASxkLJzcJh5iX0M4");
      var c = {};
      c["Content-Type"] = "audio/l16; rate=" + a.Pa;
      var d = n(a.tb, a),
        e = new Int16Array(4096 * a.t.length);
      if (0 <= a.s) {
        var f = a.s + 1,
          h = 0;
        do {
          f >= a.t.length && (f = 0);
          4096 != a.t[f].length && a.a.log(Za, "ERROR: buffer size " + a.t[f].length, void 0);
          for (var m = 0; 4096 > m; ++m) e[h++] = a.t[f][m]
        } while (f++ != a.s)
      }
      $b(b, d, "POST", e, c)

The Int16Array actually represents an array of twos-complement 16-bit signed integers.

44100Hz is the rate they use, which I've confirmed while debugging the extension.

To be fair, I haven't gotten it to work with WAV L16 PCM, I should take some time trying to capture the packets being sent with Wireshark so I can debug the payload.

cnbuff410 commented 10 years ago

When I tried the wav file, I didn't even get the empty result, what I got is

<!DOCTYPE html>

Error 400 (Bad Request)!!1

400. That’s an error.

Your client has issued a malformed or illegal request. Unknown audio encoding: 116 That’s all we know.

gillesdemey commented 10 years ago

The encoding is called l16, not 116. Can you double check that?

cnbuff410 commented 10 years ago

Hmmm thanks for the suggestion! Yes now it works but the result is not correct. I was using the same audio with both flac and wav format to do the test, flac one returned me back the correct text while the wav one returned me totally wrong answer.

I only changed the filename and Content-type. Is it expected?

gillesdemey commented 10 years ago

Can you post what the API returned?

Changing the Content-Type header and choosing the correct matching file should suffice.

cnbuff410 commented 10 years ago

If I use flac file, this is what I get

{"result":[]} {"result":[{"alternative":[{"transcript":"hello send this message to Google"},{"transcript":"send this message to Google"}],"final":true}],"result_index":0}

If I use wav file, this is what I get:

{"result":[]} {"result":[{"alternative":[{"transcript":"Shin injuries"},{"transcript":"shin injury"},{"transcript":"Sean Lennon"},{"transcript":"Sherman interview"},{"transcript":"Shawn Ashmore Inn"}],"final":true}],"result_index":0}

Another question is, do you know why do I get two results and the first result is always empty?

This is the command I was using

curl -X POST \ --data-binary @audio/test1.wav \ --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36' \ --header 'Content-Type: audio/l16; rate=44100;' \ 'https://www.google.com/speech-api/v2/recognize?output=json&lang=en-us&key=AIzaSyCnl6MRydhw_5fLXIdASxkLJzcJh5iX0M4'

gillesdemey commented 10 years ago

Alright, I've figured out how to get the PCM 16-bit encoding working. Will update the README accordingly and add an example to the audio folder.