Voice Services from a Developer's Perspective

api.ai

The signup process is totally frictionless. I just logged in using GitHub, they only accessed my email address data, which was fine with me.

Once I was logged in, there was a getting started video popup, which explains things nicely. See it on YouTube.

It only took me a few minutes to create a new agent, and an intent. I named my intent hello and programmed it to listen for the user saying "hello" and defined the action as human_greeting. I set "hi there" as the speech response for this action. The best part is I can test all this in my browser, immediately.

I said "hello" and got this json response (near instantly).

{
  "id": "6950f2bf-74bb-443c-bf81-f9a0e7e8b47d",
  "timestamp": "2016-04-06T00:24:33.569Z",
  "result": {
    "source": "agent",
    "resolvedQuery": "hi",
    "speech": "hi there",
    "action": "human_greeting",
    "parameters": {},
    "metadata": {
      "intentId": "5152b39b-f601-417f-8a54-531c90e9c4c4",
      "inputContexts": [],
      "outputContexts": [],
      "contexts": [],
      "intentName": "hello"
    },
    "score": 1
  },
  "status": {
    "code": 200,
    "errorType": "success"
  },
  "asr": {
    "hi": 0.9539373,
    "hey": 0.9490655
  }
}

This was an all around great experience. A+

Watson Developer Cloud

IBM has a package on npm called watson-developer-cloud which has functionality for speech-to-text and text-to-speech amoung other things.

The trial account is free for 30-days with no credit card required.

The signup process has noticeably more friction and requires name, phone, email and a security q+a. I had to check my email to verify it was valid and then I had to login with the credentials I just created. At the login screen it asked for my IBM ID... I guessed it was my email address and that worked. :confused:

When adding a service I read that the available languages included English (US), English (UK), Japanese, Arabic (MSA, Broadband model only), Mandarin, Portuguese (Brazil) and Spanish.

After I created a new speech-to-text service, I got my API credentials. :tada:

Their example code for speech-to-text is:

var watson = require('watson-developer-cloud');
var fs = require('fs');

var speech_to_text = watson.speech_to_text({
  username: '<username>',
  password: '<password>',
  version: 'v1'
});

var params = {
  // From file 
  audio: fs.createReadStream('./resources/speech.wav'),
  content_type: 'audio/l16; rate=44100'
};

speech_to_text.recognize(params, function(err, res) {
  if (err)
    console.log(err);
  else
    console.log(JSON.stringify(res, null, 2));
});

I didn't have a wav file so I was researching getting an audio stream using Node.js. Although I'm confident I could get it working, it wouldn't be without introducing other complexities to run a simple test. So I've opted to create a wav file manually.

The wav file and code for my test can be found here:

https://github.com/jedireza/hello-watson

My first test returned this json from Watson:

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.355,
          "transcript": "mt cn "
        }
      ],
      "final": true
    }
  ],
  "result_index": 0
}

Obviously "mt cn " isn't what I said. :confused:

Alexa Voice Service

Alexa has a lot of things to read to get started. You'll also need to create an Amazon developer account, which I already had.

I set out to follow along with the article Amazon put out a couple weeks go: Project: Raspberry Pi + Alexa Voice Service. This article takes us through everything, starting with a brand new RPi install.

Even after the base RPi setup and installing the JDK and Apache Maven, we're finally at Getting Started with Alexa Voice Service.

Some annoying things about getting this working:

The Alexa RPi article installs tons of stuff to get the demos working (JDK, Maven)
You need to generate self signed certificates
You must run the companion node service to authenticate requests
The instructions/example app require that you have a window manager installed. Installing VNC is mentioned for those who don't have an external monitor.

I found out that device makers and/or mobile app makers, who want to embed AVS into their technology must support a "Login with Amazon" feature where end users are redirected to Amazon to login and authorize our custom app/device to use AVS on their behalf. :flushed:

This would be way better if:

AVS was just another AWS offering
I could just use IAM for for credentials
Users didn't have to know who my voice provider is (should be swappable)
Wit.ai

Wit's website is great. Much like api.ai, the experience is smooth and easy to get going. Wit also has a sign-in with GitHub feature. Wit also has a nice demo you can play with in your browser: https://labs.wit.ai/demo/

After signing up, I was suprised to get a meesage bout my browser not being supported. I guess they only support webkit-based browsers. :confused:

I dusted off my copy of Chrome and signed in again.

The interface is nice and like api.ai, you're guided through the process. The docs are nice and they have an npm module which is nice for Node.js developers. They also support a HTTP api. Using the web interface I created a new "story" I'll use for a greeting.

Included in the node-wit module are examples we can run. Using my new API key I ran the example and got an interactive prompt.

$ node examples/weather.js <apiKey>
> what's the weather
Executing merge action
Executing say with message: Wazaaa, weather!
Wazaaa, weather!
> Hi my name is Reza.
Executing merge action
Executing say with message: Wazaaa, Reza!
Wazaaa, Reza!
> Hola my name is Rezilla.
Executing merge action
Executing say with message: Wazaaa, Rezilla!
Wazaaa, Rezilla!
> goodbye
Executing merge action
Executing error action
Oops, I don't know what to do.

They also support sending a wav file via HTTP and getting the intent back. I setup a Wit story so when the user says "Hello Watson" that Wit will reply with "No that's my cousin.". Using the wav file from my Watson experiement I posted it to the service:

$ curl -XPOST 'https://api.wit.ai/speech?v=20141022' -i -L \
       -H "Authorization: Bearer <apiKey>" \
       -H "Content-Type: audio/wav" \
       --data-binary /home/jedireza/projects/hello-watson/hello-watson.wav
HTTP/1.1 200 OK
Server: nginx/1.8.0
Date: Wed, 13 Apr 2016 00:43:04 GMT
Content-Type: application/json
Content-Length: 91
Connection: keep-alive

{
  "msg_id" : "b586ffdd-e92d-433b-907c-2ff623776553",
  "_text" : "",
  "outcomes" : [ ]
}

I got a successful response but it didn't have the expected reply. Maybe the wav sampling wasn't compatible? I'm not sure, I didn't get an error.

Knowing that FB bought this company, one does worry about the lifetime of the service.

I just noticed a new post on HN about Wit.ai launching "Bot Engine beta". There was an interesting comment:

Man, i really love wit.ai, one of the coolest projects i've worked with. Unfortunately, these days i'm becoming jaded to X as a Service. Things like hosting or databases as a service are quantifiable, i have an idea of how much effort it takes me to migrate away... but AI? Especially the cool AI flavored NLP that wit.ai offers - it's just too hard to migrate away from for me.

With that said, i understand how hard doing this in a home baked way could be. I think i just won't be happy until we have repositories of standardized ai training sets or baked results (forgive any pseudo terms). It just feels like these days, using awesome AI services means cementing yourself into the service, and making their service stronger as you increase their datasets and training.

As much as i really do love wit.ai, i just don't want to use these types of services unless my backs against a wall.

Source: https://news.ycombinator.com/item?id=11483861

mozilla / connected-devices-experiments