[speech-to-text] speaker labels have incorrect `result_index`

Using continuous recognition via speech_to_text.createRecognizeStream speaker labels are showing up under the incorrect result_index.

Expected behavior: speaker labels should show up in the same object where results[0].final = true, and at least have the same result_index so they can be correlated to the correct word alternatives.

Actual behavior: speaker labels show up mostly where results[0].final = false and have a result_index 1 greater than the correct number. This makes it difficult to correlate the speaker labels with other properties.

These are the settings used:

const params = {
  content_type: 'audio/ogg;codecs=opus',
  model: 'en-US_BroadbandModel',
  continuous: true,
  interim_results: true,
  timestamps: true,
  profanity_filter: false,
  word_confidence: true,
  word_alternatives_threshold: 0.05,
  speaker_labels: true
};

Hi @colinskow

Unfortunately, that's not the way the API works. You have to use word_timings to correlate the speaker_labels to the text.

The Browser JS SDK has some helper code to correlate things for you, and while it won't currently work with the RecognizeStream here in the Node.js SDK, I think you can use the RecognizeStream from the browser SDK in node.js, you'll just need to encode the credentials header yourself:

const fs = require('fs');
const { RecognizeStream, SpeakerStream } = require('watson-speech/speech-to-text');

const STT_USERNAME = '...';
const STT_PASSWORD = '...';

fs.createReadStream('some/audio/file.wav')
  .pipe(new RecognizeStream({
    headers: {
        authorization: 'Basic ' + Buffer.from(STT_USERNAME + ':' + STT_PASSWORD).toString('base64')
    },
    objectMode: true
  })
  .pipe(new SpeakerStream())
  .on('data', data => {
    console.log('data', data);
  });

Then it should break down the results by speaker and put a speaker field on each result. I haven't tried it, but I think it will work in Node.js. You can see how the browser SDK uses those two together here.

Note that each data object will include multiple results instead of the typical single result. If you don't enable interim_results there will be only a single data event; if you do, then only the last data object will have final results, because the speaker labels can change at any time before the final labels are emitted. (Text may jump from one result to another until then.)

(I have plans to share code between the two libs, but not sure when I'll be able to get to it...)

watson-developer-cloud / node-sdk

[speech-to-text] speaker labels have incorrect `result_index` #442