Directing Web Speech API audio to a specific output device?

josephrocca commented 4 years ago

Hello! Have there been any discussions around giving developers the ability to direct speech generated via the Web Speech API SpeechSynthesis interface to a specific audio output? I've not been able to find any, and it seems like a fairly important feature.

jan-ivar commented 4 years ago

I've critiziced the current Web Speech API for being too tightly coupled to microphone and default speaker output.

I suggest the Web Speech WG work to plug into existing audio sources and sinks in the platform through MediaStreamTrack (there's a precedent in web audio).

Output selection would then fall out for free. E.g.

audioElement.srcObject = speechSynthesis.createMediaStreamDestination();
audioElement.setSinkId(await navigator.mediaDevices.selectAudioOutput({deviceId}));

speechSynthesis.speak(new SpeechSynthesisUtterance("Hello world!"));

guest271314 commented 3 years ago

@jan-ivar

Neither this specification nor Media Capture and Streams define capture of devices other than microphone input.

The suggested code is currently impossible at Chromium. Chromium refuses to support listing of or capture of monitor devices at Linux https://bugs.chromium.org/p/chromium/issues/detail?id=931749. Have filed multiple specification and implementation issues to support what this issue requests; in brief see https://github.com/w3c/mediacapture-main/issues/720; https://github.com/w3c/mediacapture-main/issues/720.

To capture output of speechSynthesis.speak() at Firefox on Linux you can filter monitor device.

To capture output of sppechSynthesis.speak() at Chromium workarounds must be used, see https://github.com/guest271314/captureSystemAudio.

guest271314 commented 3 years ago

Hello! Have there been any discussions around giving developers the ability to direct speech generated via the Web Speech API SpeechSynthesis interface to a specific audio output? I've not been able to find any, and it seems like a fairly important feature.

Web Speech API does not define any speech synthesis algorithms and neither Chromium nor Firefox are shipped with a speech synthesis engine.

Web Speech API establishes a socket connection to Speech Dispatcher speechd https://github.com/brailcom/speechd.

Web Speech API does not currently specify any means to capture audio output from speechSynthesis.speak().

Since Web Speech API simply sommunicates with locally installed speech synthesis engine, one approach would be no use Web Speech API at all. Rather, install one or more speech synthesis engines locally and communicate with the engine directly. For example, the output from espeak-ng https://github.com/espeak-ng/espeak-ng is 1 channel WAV, where STDOUT (raw binary data) from $ espeak-ng --stdout 'test' can be passed as a message to any origin parsed to Float32Array and set as outputs at AudioWorkletProcessor.process(), where a MediaStream can be used for output using MediaStreamAudioDestinationNode. This is one wortking version of using Native Messaging with espeak-ng to capture speech synthesis output https://github.com/guest271314/native-messaging-espeak-ng, will update the above to the version described above

guest271314 commented 3 years ago

@josephrocca There is not a simple way to get the direct output from a speech synthesis engine other than calling the engine directly and processing raw audio output. Technically, a socket connection can be established to speech-dispatcher

No specification, including Media Capture and Streams, Audio Output Devices API, Web Audio API, or Web SPeech API (see MediaStream, ArrayBuffer, Blob audio result from speak() for recording?, https://github.com/WebAudio/web-audio-api-v2/issues/10#issuecomment-682259080) defines a means to access or capture speech synthesis engine output directly.

$ speech-dispatcher -h
Speech Dispatcher -- Common interface for Speech Synthesis (GNU GPL)

Usage: speech-dispatcher [-{d|s}] [-l {1|2|3|4|5}] [-c com_method] [-S socket_path] [-p port] [-t timeout] | [-v] | [-h]

Options: 
  -d, --run-daemon      Run as a daemon
  -s, --run-single      Run as single application
  -a, --spawn           Start only if autospawn is not disabled
  -l, --log-level       Set log level (between 1 and 5)
  -L, --log-dir         Set path to logging
  -c, --communication-method
                        Communication method to use ('unix_socket'
                        or 'inet_socket')
  -S, --socket-path     Socket path to use for 'unix_socket' method
                        (filesystem path or 'default')
  -p, --port            Specify a port number for 'inet_socket' method
  -t, --timeout         Set time in seconds for the server to wait before it
                        shuts down, if it has no clients connected
  -P, --pid-file        Set path to pid file
  -C, --config-dir      Set path to configuration
  -m, --module-dir      Set path to modules
  -v, --version         Report version of this program
  -D, --debug           Output debugging information into $TMPDIR/speechd-debug
                        if TMPDIR is exported, otherwise to /tmp/speechd-debug
  -h, --help            Print this info

Please report bugs to speechd-discuss@nongnu.org

Aside from more elaborate solutions that involve growing WebAssembly.Memory https://github.com/WebAudio/web-audio-api-v2/issues/97 and streaming monitor device from Nightly to Chromium https://gist.github.com/guest271314/04a539c00926e15905b86d05138c113c one solution is to use a local server. There are then ways to get the MediaStreamTrack from localhost to any origin. Note capturing monitor device captures all system audio output, not only from speech-dispatcher speech synthesis module.

You can use any language for a server. Here we use php with espeak-ng speech synthesis engine

speak.php

<?php 
  if (isset($_POST["speak"])) {
    header("Access-Control-Allow-Origin: localhost:8000");
    header("Content-Type: application/octet-stream");
    $input = urldecode($_POST["speak"]);
    $options = urldecode($_POST["options"]);
    echo passthru("espeak-ng --stdout " . $options . " '" . $input . "'");
    exit();
  }

Using MediaStreamAudioSourceNode

// https://stackoverflow.com/a/35248852
function int16ToFloat32(inputArray) {
    const output = new Float32Array(inputArray.length);
    for (let i = 0; i < output.length; i++) {
        const int = inputArray[i];
        // If the high bit is on, then it is a negative number, and actually counts backwards.
        const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
        output[i] = float;
    }
    return output;
}
var fd = new FormData();
fd.append('options', '-v Storm');
fd.append('speak', `Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to 
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesnt want you to be correct.

- Neil deGrasse Tyson, May 3, 2017 at 92nd Street Y`);
fetch('', {method:'post', body:fd})
.then(r => r.arrayBuffer())
.then(async arrayBuffer => {
  const uint16 = new Uint16Array(arrayBuffer.slice(44));
  const floats = int16ToFloat32(uint16, 1, uint16.length);
  const ac = new AudioContext({sampleRate: 22050});
  const buffer = new AudioBuffer({
                           numberOfChannels: 1, 
                           length: floats.byteLength, 
                           sampleRate: ac.sampleRate
  });
  console.log(floats);
  buffer.getChannelData(0).set(floats);
  const absn = new AudioBufferSourceNode(ac, {buffer});
  // cannot be connected directly to AudioContext.destination
  const msd = new MediaStreamAudioDestinationNode(ac);
  const {stream: mediaStream} = msd;
  const source = new MediaStreamAudioSourceNode(ac, {mediaStream});
  absn.connect(msd);
  absn.start();
  source.connect(ac.destination);
});

Using AudioWorkletNode with a single Float32Array initially passed. We could write the stream STDOUT to a single ArrayBuffer or SharedArrayBuffer using Response.body.getReader() and read from that memory in process()

// https://stackoverflow.com/a/35248852
function int16ToFloat32(inputArray) {
    const output = new Float32Array(inputArray.length);
    for (let i = 0; i < output.length; i++) {
        const int = inputArray[i];
        // If the high bit is on, then it is a negative number, and actually counts backwards.
        const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
        output[i] = float;
    }
    return output;
}
var fd = new FormData();
fd.append('options', '-v Storm');
fd.append('speak', `Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to 
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesnt want you to be correct.

- Neil deGrasse Tyson, May 3, 2017 at 92nd Street Y`);
fetch('', {method:'post', body:fd})
.then(r => r.arrayBuffer())
.then(async arrayBuffer => {
  const uint16 = new Uint16Array(arrayBuffer.slice(44));
  const floats = int16ToFloat32(uint16, 1, uint16.length);
  const ac = new AudioContext({sampleRate: 22050});
  console.log(ac.state);
  class AudioWorkletProcessor {}
          class SpeechSynthesisStream extends AudioWorkletProcessor {
            constructor(options) {
              super(options);
              Object.assign(this, options.processorOptions);
              globalThis.console.log(this.floats);
              this.port.postMessage({start:this.start = !this.start});
            }
            endOfStream() {
              this.port.postMessage({
                ended: true,
                currentTime,
                currentFrame,
                readOffset: this.readOffset,
              });
            }
            process(inputs, outputs) {

              const [channel] = outputs.flat();
              if (
                this.readOffset >= this.floats.length
              ) {
                console.log(this);
                this.endOfStream();
                return false;
              }
              const data = Float32Array.from({length: 128}, _ => {
                const index = this.readOffset;
                if (index > this.floats.length) return 0;
                return this.floats[this.readOffset++];
              });
              channel.set(data);
              return true;
            }
          }
          // register processor in AudioWorkletGlobalScope
          function registerProcessor(name, processorCtor) {
            return `${processorCtor};\nregisterProcessor('${name}', ${processorCtor.name});`;
          }
          const worklet = URL.createObjectURL(
            new Blob(
              [
                registerProcessor(
                  'speech-synthesis-stream',
                  SpeechSynthesisStream
                ),
              ],
              { type: 'text/javascript' }
            )
          );
            ac.onstatechange = e => console.log(ac.state);

          await ac.audioWorklet.addModule(worklet);
          const aw = new AudioWorkletNode(
            ac,
            'speech-synthesis-stream',
            {
              numberOfInputs: 1,
              numberOfOutputs: 1,
              channelCount: 1,
              processorOptions: {
                readOffset: 0,
                ended: false,
                start: false,
                floats
              },
            }
          );

          aw.onprocessorerror = e => {
            console.error(e);
            console.trace();
          };
          const msd = new MediaStreamAudioDestinationNode(ac);
          const { stream } = msd;
          const [track] = stream.getAudioTracks();
          aw.connect(msd);
          aw.connect(ac.destination);
          // const recorder = new MediaRecorder(stream);
          // recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
          if (ac.state === 'running') {
            await ac.suspend();
          }
          aw.port.onmessage = async e => {
            console.log(e.data, ac.state);
            if (
              e.data.start &&
              ac.state === 'suspended'
            ) {
             await ac.resume();
              // recorder.start();  
            } else {
              // if (recorder.state === 'recording') {             
              // recorder.stop();
              track.stop();
              aw.disconnect();
              msd.disconnect();
              await ac.close();
              console.log(track);
           }          
          };
});

w3c / mediacapture-output

Directing Web Speech API audio to a specific output device? #102