Clarify getUserMedia({audio:{deviceId:{exact:<audiooutput_device>}}}) in this specification mandates capability to capture of audio output device - not exclusively microphone input device

w3c / mediacapture-main

Media Capture and Streams specification (aka getUserMedia)

https://w3c.github.io/mediacapture-main/

Other

123 stars 61 forks source link

Clarify getUserMedia({audio:{deviceId:{exact:<audiooutput_device>}}}) in this specification mandates capability to capture of audio output device - not exclusively microphone input device #650

Closed guest271314 closed 4 years ago

guest271314 commented 4 years ago

https://github.com/w3c/mediacapture-main/pull/211 added output device capability to enumerateDevices() while concerns were raised about the definition output device, or the omission thereof, in the specification, e.g.,

https://github.com/w3c/mediacapture-main/pull/211#issuecomment-130468068

It seems to me we've forgotten to define output devices. Relying on their similarity to input devices, is the weak link in this reasoning imho.

Currently the term "audiooutput" occurs twice in the specification, where the language appears to be a brief description of the term, not explicitly a definition of the term

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

A pair of headphones could not reasonably be construed as a microphone.

However, in spite of "audiooutput" and the brief description appearing the text of the specification, at least one implementer has interpretated the language to not explicitly mean capture of audio output is mandated by the current specification https://bugs.chromium.org/p/chromium/issues/detail?id=1013881#c9

The getUserMedia() spec does not mandate capturing audio output or showing a UI prompt as part of the device selection procedure.

At least one concrete use case where the definition of "audiooutput", the devices list from enumerateDevices(), and whether or not the specification mandates capture of audio output devices, where clarity or lack thereof is observable, consider the code

    (async() => {
      navigator.mediaDevices.ondevicechange = e => console.log(e);
      const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
          deviceId: {
            exact: await navigator.mediaDevices.enumerateDevices()
                   .then(devices =>
                     devices.find(({
                       kind, label, groupId
                     }) => label === "Monitor of Built-in Audio Analog Stereo" // Firefox
                             || kind === "audiooutput" && groupId !== "default" // Chromium
                     ))
                     .deviceId
            }
          }
        });
      const [audioTrack] = stream.getAudioTracks();
      audioTrack.onmute = audioTrack.onended = e => console.log(e);
      const text = [...Array(10).keys()].join(" ");
      const handleVoicesChanged = async e => {
        const voice = speechSynthesis.getVoices().find(({
          name
        }) => name.includes("English"));
        const utterance = new SpeechSynthesisUtterance(text);
        utterance.voice = voice;
        utterance.pitch = 0.33;
        utterance.rate = 0.1;
        const recorder = new MediaRecorder(stream);
        recorder.start();
        speechSynthesis.speak(utterance);
        recorder.ondataavailable = async({
          data
        }) => {
          (new Audio(URL.createObjectURL(data))).play();
        }
        utterance.onend = e => 
          (recorder.state === "recording" && recorder.stop()
          , audioTrack.stop());
      }
      speechSynthesis.onvoiceschanged = handleVoicesChanged;
      let voices = speechSynthesis.getVoices();
      if (voices.length) {
        handleVoicesChanged();
      };

    })().catch(console.error);

which is intended to select only "audiooutput", not "Microphone".

Firefox 70 and Nightly 73 outputs the expected result, that is, capturing and recording only audio output, not input from microphone: Meaning only audio output is captured, not microphone input and audio output.

Chromium 80 does not output the expected result. Even where "audiooutput" is selected microphone is captured and recorded, not "audiooutput". That is a Chromium bug that is marked WontFix (https://bugs.chromium.org/p/chromium/issues/detail?id=1013881) apparently due to lack of clarity in this specification relevant to the capability to select audio output - not only microphone.

Contrary to the suggestion at https://github.com/w3c/mediacapture-main/issues/629#issuecomment-545844012 getDisplayMedia() after testing various approaches, does not provide any means to capture audio output from the system.

Kindly make it clear in this specification that 1) capture of audio output is under the umbrella of this specification and provide an example of the canonical code pattern to achieve that use case per this specification; 2) the user can select "Monitor of <audio_device>" at UI prompt and directly in code by use of applyConstraints() and directly at getUserMedia(<constraints>); or 3) this specification is not intended to be construed to capture only audio output.

alvestrand commented 4 years ago

My opinion:

The same thing should happen as if you do getusermedia({audio: {deviceid: exact: videodevice}) - that is, no device should be found.

If the system has the capability of capturing outgoing audio, that should be represented as a separate input device.

The only place in the specification set where capturing system audio output is referenced is in GetDisplayMedia.

guest271314 commented 4 years ago

@alvestrand

The only place in the specification set where capturing system audio output is referenced is in GetDisplayMedia.

Are you actually saying that

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

does not mean capturing audio output on its face?

How else is "audiooutput" and "Represents an audio output device; for example a pair of headphones" intended to be interpreted other than the capability to capture "audio output device" - "for example a pair of headphones"?

Are you conveying that, in your opinion, "audiooutput" and "headphones" is really intended to mean "Microphone" and audio input?

guidou commented 4 years ago

@alvestrand

The only place in the specification set where capturing system audio output is referenced is in GetDisplayMedia.

Are you actually saying that

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

does not mean capturing audio output on its face?

Correct. It does not mean capturing audio output. It means playing audio, not capturing it. Devices of audiooutput kind can be used with the setSinkId method of media elements to set which audio device the element should use to play audio.

How else is "audiooutput" and "Represents an audio output device; for example a pair of headphones" intended to be interpreted other than the capability to capture "audio output device" - "for example a pair of headphones"?

See above.

Are you conveying that, in your opinion, "audiooutput" and "headphones" is really intended to mean "Microphone" and audio input?

No. It means audio output, not input.

Note also that input devices (audioinput and videoinput) returned by enumerateDevices are of type InputDeviceInfo, whereas output devices (audiooutput) are not.

I propose we close this issue since getUserMedia() is not intended to capture audio from devices marked as audiooutput by enumerateDevices() and I think there is no doubt about it.

guest271314 commented 4 years ago

@guido

Correct. It does not mean capturing audio output. It means playing audio, not capturing it.

Reads that way from perspective here.

Performs that way at Firefox, Nightly.

I propose we close this issue since getUserMedia() is not intended to capture audio from devices marked as audiooutput by enumerateDevices() and I think there is no doubt about it.

There is far more than doubt. It is already currently possible to capture audio output excludively at Firefox and Nightly.

You, or someone else, needs to include that exact language in the specification, and remove "audiooutput" altogether if that is what you want, otherwise what you ask for is an absurd interpretation of "audioouput" and headphones.

At which specification should the PR be filed to specifically add capture of audio output, after you include language which prohibits capturing audio output - as that is very much unclear right now?

guidou commented 4 years ago

@guido

Correct. It does not mean capturing audio output. It means playing audio, not capturing it.

Reads that way from perspective here.

Performs that way at Firefox, Nightly.

Firefox does not even expose entries of kind audiooutput in enumerateDevices (tested with Firefox Nightly 73.0a1, 2019-12-09). The capture of "Monitor of ..." devices is via entries exposed as audioinput by enumerateDevices, as it should be.

I propose we close this issue since getUserMedia() is not intended to capture audio from devices marked as audiooutput by enumerateDevices() and I think there is no doubt about it.

There is far more than doubt. It is already currently possible to capture audio output excludively at Firefox and Nightly.

I mean there is no doubt among browser implementers since no browser allows capturing from entries marked as audiooutput by enumerateDevices(). All browsers allow capturing audio from devices exposed as audioinput.

The getUserMedia() definition says "Prompts the user for permission to use their Web cam or other video or audio input", which makes it clear that it operates on input devices. Perhaps an extra clarification can be added there to indicate that "video or audio input" refers to devices marked as videoinput or audioinput in the results provided by enumerateDevices. The absence of that extra text so far has not led to any inconsistent behavior across various browsers. Feel free to send a PR for review with text along those lines, though.

You, or someone else, needs to include that exact language in the specification, and remove "audiooutput" altogether if that is what you want, otherwise what you ask for is an absurd interpretation of "audioouput" and headphones.

At which specification should the PR be filed to specifically add capture of audio output, after you include language which prohibits capturing audio output - as that is very much unclear right now?

There is no need to exclude audiooutput from the spec and there is nothing absurd about it. They represent audio output devices, which are intended for audio playback, not capture. They can be used with setSinkId to select the device to be used by a media element to output audio, or, indirectly to select an associated input device in getUserMedia() via the groupId field.

guest271314 commented 4 years ago

getDisplayMedia() does not capture audio at all at Chromium.

setSinkId() captures only microphone, not audio output even where the audio output device deviceId is selected.

Do not care if getDisplayMedia() or getUserMedia() needs to be used to capture audio output. If neither method is intended to capture "audioouput" whatsoever that needs to be made clear in the specifications. If getDisplayMedia() in fact is capable of capturing audio output, e.g., to "headphones", as Media Capture and Streams specification currently implies by language

audiooutput | Represents an audio output device; for example a pair of headphones.

though that language is actually intended to apply to getDisplayMedia(), a canonical example of capturing audio output only - not audio input from a microphone - needs to be included in the specification.

The use case is very simple: capture audio output of speechSythesis.speak() which outputs audio directly from the system (https://github.com/guest271314/SpeechSynthesisRecorder; https://github.com/WICG/speech-api/issues/69).

Or, make it unequivocally clear that neither this parent specification nor any derivative specification is intended to capture audio output, so that alternative, non-standardized approaches, can be implemented.

guest271314 commented 4 years ago

@guidou We must have posted at the same time.

Firefox does not even expose entries of kind audiooutput in enumerateDevices (tested with Firefox Nightly 73.0a1, 2019-12-09). The capture of "Monitor of ..." devices is via entries exposed as audioinput by enumerateDevices, as it should be.

Again, do not care what technical jargon is used re inout or output. The use case is capturing only audio output from the system, not capturing microphone input where audio might be playing in the background.

Firefox does provide a means to capture "Monitor of <device>". Whether you call that an input or output device is immaterial to the observable result: capturing audio output from the system, which Firefox provides a means to do simply by exposing "Monitor of <devide>" which you appear to be against implementing at Chromium, Chrome, for an unknown reason, which would tentatively resolve this issue.

guest271314 commented 4 years ago

@guidou Kindly run the code at https://github.com/w3c/mediacapture-main/issues/650#issue-534574931 at Firefox, Nightly and Chromium.

Select "Monitor of <device> at the prompt. Observe the different output. Firefox captures only audio output, the desired result. Chromium does not expose "Monitor of <device>" thus microphone is captured, not the audio output from speech-dispatcher calling espeak-ng via spd-say.

Is there any compelling reason to not expose "Monitor of <device>" at Chromium?

What solution do you suggest to achieve the requirement of the use case if neither getUserMedia() nor getDisplayMedia() are intended to accomplish the task at Chromium?

guidou commented 4 years ago

@guidou Kindly run the code at #650 (comment) at Firefox, Nightly and Chromium.

The script there is broken for Chromium because it assumes "audiooutput" devices can be captured by getUserMedia(), which is not the case in Chromium or any other browser. Note also that Firefox does not support "Monitor of " devices on all platforms. I just tried on Firefox for Mac and it only supports capturing from microphones. Haven't checked Windows.

Select "Monitor of <device> at the prompt. Observe the different output. Firefox captures only audio output, the desired result. Chromium does not expose "Monitor of <device>" thus microphone is captured, not the audio output from speech-dispatcher calling espeak-ng via spd-say.

Is there any compelling reason to not expose "Monitor of <device>" at Chromium?

I wouldn't be able to state any reason why Chromium does not support any particular feature it does not currently support. Feel free to file a feature request for it at crbug.com, although I'm not aware of any plans to support this use case in Chromium.

What solution do you suggest to achieve the requirement of the use case if neither getUserMedia() nor getDisplayMedia() are intended to accomplish the task at Chromium?

At the moment, you cannot accomplish that task in Chromium directly. Perhaps you can find some tool that allows you to expose the audio of output devices as if they were microphones, similar to how virtual webcams work.

Note also that your use case is not mandated by this spec. Exposing the audio coming from an output device as an input device that can be captured via getUserMedia is a valid implementation choice (i.e., Firefox has implemented it on Linux), but it is not a requirement of this spec.

What you have here is a feature request for Chromium to expose "Monitor of ..." devices the way Firefox does so that they can be used by getUserMedia(). I think that's a valid feature request for Chromium, since it does not support it, but it does not need any adjustment to the spec. Also, I don't think the spec should be changed to mandate that audio output devices must be exposed as if they were audio input devices.

guest271314 commented 4 years ago

@guidou

Perhaps an extra clarification can be added there to indicate that "video or audio input" refers to devices marked as videoinput or audioinput in the results provided by enumerateDevices. The absence of that extra text so far has not led to any inconsistent behavior across various browsers. Feel free to send a PR for review with text along those lines, though.

Your acknowledgment that the current language is capable of more than one interpretation re "audiooutput" and "headphones" meaning capturing audio is part of this specification. Will not be filing that PR. From perspective here the language indicates capture of audio output is specified, at least it is not clear that that capability is not intended. The prerogative here is to use that language to capture audio. If you bbelieve there is room to close that interpretation, then clarify the specification yourself, as that what this issue is asking for. Would file a PR to make it clear the specification does include language already to indicate capturing audio output, not that it does not.

The script there is broken for Chromium because it assumes "audiooutput" devices can be captured by getUserMedia(), which is not the case in Chromium or any other browser

At *nix the code works as expected.

Have not used *indows in many years and have not used Mac at all.

Therefore, it is reasonable to conclude that *indows and Mac also provide such functionality. Evidently not. They should, per this specification, is the perspective here.

Feel free to file a feature request for it at crbug.com

Already did. You closed the issue https://bugs.chromium.org/p/chromium/issues/detail?id=1013881 as WontFix.

What you have here is a feature request for Chromium to expose "Monitor of ..." devices the way Firefox does so that they can be used by getUserMedia(). I think that's a valid feature request for Chromium, since it does not support it, but it does not need any adjustment to the spec

Kindly re-open the above linked Chromium bug and answer this question: Why should that functionality not be available to users? Disregard the specification or not, the functionality is what matters, implementers do whatever they want anyway, irrespective of any specification, whether by omission, deliberate indifference to any spec, or by way of their arbitrary, undocumented "experiments".

guidou commented 4 years ago

@guidou

Perhaps an extra clarification can be added there to indicate that "video or audio input" refers to devices marked as videoinput or audioinput in the results provided by enumerateDevices. The absence of that extra text so far has not led to any inconsistent behavior across various browsers. Feel free to send a PR for review with text along those lines, though.

Your acknowledgment that the current language is capable of more than one interpretation re "audiooutput" and "headphones" meaning capturing audio is part of this specification.

I have no idea how you can conclude that the current language allows interpreting "audiooutput" devices as audio input devices. No implementer does and this is the first time that I see this interpretation.

Will not be filing that PR. From perspective here the language indicates capture of audio output is specified, at least it is not clear that that capability is not intended.

Where is it said that capture of audio output devices is specified? The spec is pretty clear saying getUserMedia operates on input devices.

The prerogative here is to use that language to capture audio. If you bbelieve there is room to close that interpretation, then clarify the specification yourself, as that what this issue is asking for. Would file a PR to make it clear the specification does include language already to indicate capturing audio output, not that it does not.

I don't think any extra language is needed. I'm just saying that an extra clarification saying that audio input refers to devices marked as "audioinput" and video input refers to devices marked as "videoinput" would not be out of place, but it's pretty obvious that "audioinput" refers to audio input and "audiooutput" refers to audio output.

The script there is broken for Chromium because it assumes "audiooutput" devices can be captured by getUserMedia(), which is not the case in Chromium or any other browser

At *nix the code works as expected.

No, it doesn't since it expects getUserMedia to capture from devices marked as "audiooutput". It works in Firefox for Linux due to implementation-specific characteristics such as how some exposed devices are named. If those strings were localized in Firefox It might not work in all locales, for example (I don't know if those strings are localized or not).

Have not used *indows in many years and have not used Mac at all.

Therefore, it is reasonable to conclude that *indows and Mac also provide such functionality. Evidently not. They should, per this specification, is the perspective here.

Feel free to file a feature request for it at crbug.com

Already did. You closed the issue https://bugs.chromium.org/p/chromium/issues/detail?id=1013881 as WontFix.

That was filed as a bug, which it is not. Therefore it cannot be "fixed". Feel free to file a feature request.

What you have here is a feature request for Chromium to expose "Monitor of ..." devices the way Firefox does so that they can be used by tt(). I think that's a valid feature request for Chromium, since it does not support it, but it does not need any adjustment to the spec

Kindly re-open the above linked Chromium bug. File a new feature request entry at crbug.com.

and answer this question: Why should that functionality not be available to users? I don't think anyone has an obligation to explain why something doesn't exist, in particular something probably no one had requested until now.

Disregard the specification or not, the functionality is what matters, implementers do whatever they want anyway, irrespective of any specification, whether by omission, deliberate indifference to any spec, or by way of their arbitrary, undocumented "experiments".

My experience has been that implementers try to implement the spec and I would say they have succeeded for the most part since, although not perfect, there is large degree of interoperability across browsers. Things are sure to fail when you expect things that are not in the spec to work, such as having getUserMedia() capture from devices marked as "audiooutput" or having implementation specific details not covered by the spec to be the same in all implementations.

guest271314 commented 4 years ago

@guido

No, it doesn't since it expects getUserMedia to capture from devices marked as "audiooutput". It works in Firefox for Linux due to implementation-specific characteristics such as how some exposed devices are named.

Why is

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

in the specification?

Why would a reader of the specification not reach the conclusion that it is possible to capture audio output per this specification where the plain language states that an audiooutput represents "Represents an audio output device; for example a pair of headphones." where no language in the specification prohibits such an interpretation?

That was filed as a bug, which it is not. Therefore it cannot be "fixed". Feel free to file a feature request.

Stating that in the comment before you closed the issue https://bugs.chromium.org/p/chromium/issues/detail?id=1013881#c8

This is a feature request to eliminate steps, not a bug.

do you not have the ability to change the "Type" from "Bug" to "Feature request"?

If not, how to make it clear that the issue is a feature request, not a bug?

Yes, implementers do try to meet the spec. They might also do whatever they want, irrespective of any specification, without providing any documentation why https://bugs.chromium.org/p/chromium/issues/detail?id=1018580#c67

What is the formal documented description and expected result of the "Finch experiment" that produced this bug?

Have no issue filing the feature request, again, if you cannot change the "Type" to feature request on the issue you closed.

Do not gather we will agree on interpretation of the specification re the meaning of "audiooutput" and your interpretation of "headphones" (output) meaning "Microphone" (input). You can resolve that inconsistency by updating the specification to make it clear that capturing "audiooutput" device really meaning capturing input device, not audio output from the system, in spiite of what actually occurs at Firefox.

guidou commented 4 years ago

@guido

No, it doesn't since it expects getUserMedia to capture from devices marked as "audiooutput". It works in Firefox for Linux due to implementation-specific characteristics such as how some exposed devices are named.

Why is

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

in the specification?

Because knowing the audio output devices is useful for some use cases. For example, you may want audio to be rendered on a particular output device, or you may want getUserMedia() to select input device (i.e., microphone) associated to a particular output device (i.e., headphone).

Why would a reader of the specification not reach the conclusion that it is possible to capture audio output per this specification where the plain language states that an audiooutput represents "Represents an audio output device; for example a pair of headphones." where no language in the specification prohibits such an interpretation?

The actual question I have is why would a reader conclude that it would be possible for getUserMedia() to capture from output devices when its definition mentions only input devices. Even the implementations that allow capturing audio output (i.e., Firefox on Linux) do it by exposing audio output as input devices (i.e., "audioinput" kind in enumerateDevices).

That was filed as a bug, which it is not. Therefore it cannot be "fixed". Feel free to file a feature request.

Stating that in the comment before you closed the issue https://bugs.chromium.org/p/chromium/issues/detail?id=1013881#c8

This is a feature request to eliminate steps, not a bug.

do you not have the ability to change the "Type" from "Bug" to "Feature request"?

If not, how to make it clear that the issue is a feature request, not a bug?

Yes, it is possible to change the type from Bug to Feature request, but the description of crbug.com/1013881 is very different from what you want. That bug contains repro steps for a bug consisting in the permission prompt being broken because "Monitor of

My recommendation is that you file a new issue where you say that you would like Chromium to:

Expose "Monitor of audio output" devices as audio input devices in enumerateDevices().
Allow getUserMedia() to capture from them.
Make any necessary UI adjustments.

Yes, implementers do try to meet the spec. They might also do whatever they want, irrespective of any specification, without providing any documentation why https://bugs.chromium.org/p/chromium/issues/detail?id=1018580#c67

It would be a mistake to document internal implementation details that are subject to change at any time. Of course, since Chromium's source code is available, you are free to inspect it to learn about such details if you are interested.

Have no issue filing the feature request, again, if you cannot change the "Type" to feature request on the issue you closed.

I already explained why it would be better to file a new one.

Do not gather we will agree on interpretation of the specification re the meaning of "audiooutput" and your interpretation of "headphones" (output) meaning "Microphone" (input). You can resolve that inconsistency by updating the specification to make it clear that capturing "audiooutput" device really meaning capturing input device, not audio output from the system, in spiite of what actually occurs at Firefox.

I don't think anyone has interpreted "headphones" (output) to mean "Microphone" (input). It is pretty clear that headphones are a good example of output devices and as such they would be listed by enumerateDevices() as kind "audiooutput". getUserMedia() cannot capture from them since it captures from input devices, but you can use their groupId to select an input device to be used by getUserMedia(). I see no inconsistency about this in the spec.

Finally, I don't think there is anything else to discuss in this issue since it is clear that what you want is Chromium to allow capturing audio output using getUserMedia() by exposing special "Monitor of..." input devices the way Firefox for Linux does. This does not require any change to the spec.

guest271314 commented 4 years ago

The actual question I have is why would a reader conclude that it would be possible for getUserMedia() to capture from output devices when its definition mentions only input devices.

Because

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

at least amends, if not repeals and substitutes for

Note that this document describes the use of microphone and camera type sources only

by implication.

guidou commented 4 years ago

The actual question I have is why would a reader conclude that it would be possible for getUserMedia() to capture from output devices when its definition mentions only input devices.

Because

MediaDeviceKind Enumeration description audiooutput | Represents an audio output device; for example a pair of headphones.

at least amends, if not repeals and substitutes for

Note that this document describes the use of microphone and camera type sources only

by implication.

There is no amendment or substitution at all, or anything that implies it.

The MediaDeviceKind part that you mention is in Section 9, which defines enumerateDevices() (not getUserMedia). The definition of enumerateDevices() explicitly states that it allows querying input and output devices.

getUserMedia() is defined in a different section that says it deals with input devices (not output). Results are returned as MediaStream/MediaStreamTrack. The only supported sources for those MediaStreamTracks are microphones or webcams (i.e., input devices only) and those are the only sources supported in this spec.

Note that enumerateDevices does not deal at all with MediaStreamTracks or sources, while getUserMedia does not deal at all with MediaDeviceKind. There is no way to interpret from the spec text that one substitutes the other.

In short, output devices are mentioned only for enumerateDevices() and are not mentioned anywhere as possible sources for MediaStreamTracks/getUserMedia.

guest271314 commented 4 years ago

@guidou Did not write the code the implemented the methods. Assigned self the requirement to capture and record the output of speechSynthesis.speak() while not recording microphone. Achieved the requirement using the code at this issue. At the front end the observable result is microphone is not recorded, output to "speakers or headphones" is recorded. Therefore, by some means attributable to or derived from this specification, whether labeled "input" or "output", it is technically the output that Firefox outputs is possible at Chromium. This issue merely asks for clarification in the specification and implementations that is possible.

How do you explain the output at Firefox for the code, unless it is possible to isolate capture of "speakers or headphones" per this specification?

If isolated selection and capture of "speakers and headphones" is not intended whatsover, why are those terms in the specification? And why is that specific output observable? Am merely asking to take account of what is possible in the field, clarify that output is possible, whether the term of art used is "input" or "output", officially clarify that some combination of the methods defined in this specification does provides a means to capture exclusively "speakers or headphones", and provide the canonical procedure to do just that. Or, remove all references to "speakers or headphones" and "audiooutput | Represents an audio output device; for example a pair of headphones." from the specification, making it clear that this specification does not support that output. Why would that be the case?

guest271314 commented 4 years ago

@guidou At a relatively recent version of Chromium was able to achieve the same output as Firefox, that is not recording microphone, tested by playing sounds into the microphone during the procedure described at https://github.com/guest271314/SpeechSynthesisRecorder/issues/14#issuecomment-527020198. However, it should be possible to achieve that directly at the browser, which was the restriction for the requirement assigned to self: use only API's and methods shipped with the ostensibly FOSS browser, which given the state of the art, should be specified and unequivocally possible, without ambiguity, by default.

jan-ivar commented 4 years ago

As explained in https://github.com/w3c/mediacapture-main/pull/651#issuecomment-565195323 as well as in comments here, the model in this spec is source->sink, and the Firefox's "Monitor of" device is an "audioinput", a source.

In contrast, "audiooutput" is a sink and is enumerated for use with setSinkId, not getUserMedia. Mandating the latter would be a mistake, as it would effectively prevent a browser from exposing an output device it does not also support capture of. The existing spec model doesn't have that problem, nor does it prevent such cases.

The use case is very simple: capture audio output of speechSythesis.speak() which outputs audio directly from the system

If the desire is to get at the output of speechSynthesis, please take that up with the working group responsible for speechSynthesis directly. Solving that here would be a hack in my view.

guest271314 commented 4 years ago

@jan-ivar

If the desire is to get at the output of speechSynthesis, please take that up with the working group responsible for speechSynthesis directly.

The use case is not limited to capturing speechSynthesis.speak() output. The source (input device) can be any audio input output by the system.

Currently the Web Speech API does not have any algorithm language. A socket connection is made by the client browser to speech-dispatcher which executes festival, flite, espeak, espeak-ng or other speech synthesis module. No speech synthesis occurs in the browser. The executable (speech synthesis module) must be installed on the system for synthesis to occur. Besides bringing the technology to the fore within the domain of Web platform that is essentially the sum of the Web Speech API at the present state. There is no option to pass a file, capture the media output, or write to a file. Perhaps progress will be made there.

Solving that here would be a hack in my view.

The change being asked for is to merely make it clear that device listed as monitor of default audio device MAY be exposed by implementations, to at least recognize that option is available, even if implementers decide to not expose the monitor device.

Since there is no hope for language to be specified that we can capture monitor of input audio device source directly at constraints passed to getUserMedia() what we are left with in the field is to try to create one or more hack. Will eventually find a way to pipe the output of

ffmpeg -f pulse -i alsa_output.pci-0000_00_1b.0.analog-stereo.monitor

and

espeak-ng --stdout -d 0 'speak' | ffmpeg -i - -f opus -

to a MediaStreamTrack using JavaScript instead of exclusively piping output to a file first

| tee $HOME/test.ogg | chromium-browser --user-data-dir=$HOME/test $HOME/test.ogg

then fetching the file.

Will dive into https://github.com/pettarin/espeakng.js-cdn/blob/master/js/demo.js to substitute AudioWorklet for createScriptProcessor where it should be possible to "hack" a MediaStreamTrack for output (after loading 2MB of data https://github.com/pettarin/espeakng.js-cdn/blob/master/js/espeakng.worker.data, though that is less than 279.75 MiB of data https://webrtc.googlesource.com/src just to create a MediaStreamTrack of monitor of audio input exposed to the browser). Not ideal. Though that is the state of the art.

guest271314 commented 4 years ago

@jan-ivar

Happy New Year!

Created several workarounds, or what you might refer to as "a hack".

It turns out that at Ubuntu is shipped localhost on by default to test the Apache server therefore all that is necessary to use that test server is to save a script in the /var/www/html/ directory, e.g., index.php

<?php 
  if(isset($_POST["text_or_ssml"])) {
    header("Content-Type: audio/ogg");
    $options = $_POST["options"];
    echo shell_exec("ESPEAK_DATA_PATH=/home/user/espeak-ng LD_LIBRARY_PATH=src:${LD_LIBRARY_PATH} /home/user/espeak-ng/src/espeak-ng -m --stdout " . $options . " '" . $_POST["text_or_ssml"] . "' | ffmpeg -i - -f opus -");
  };

where with the appropriate flags set or including CORS header localhost can be requested from any origin.

For a more elaborate solution that would up being a proof-of-concept for https://github.com/whatwg/html/issues/3443 and https://github.com/WICG/native-file-system/issues/97 created a pattern that provides a means to execute local arbitrary shell scripts and set the wav file to be used for --use-file-for-fake-audio-capture from the browser.

While testing the code it became obvious that there is no way to determine precisely when audio output of speech synthesis actually ends when the output mechanism is a MediaStreamTrack - at least not when using the approach of setting the local wav file to be played, as Chromium does not fire ended, mute, or unmute events for the MediaStreamTrack and the input, since we are potentially expecting SSML input text, can include

<break time="5000ms">

where if we test for silence https://stackoverflow.com/a/46781986 in order to determine when the expected audio output is complete we could prematurely call stop() during an intended <break time="5000ms">, and since a MediaStream is infinite there is no default end to the MediaStreamTrack in this case. However, Support SpeechSynthesis to a MediaStreamTrack (https://github.com/WICG/speech-api/issues/69) was the requirement, thus leave it to the OP of that requirement to find that out for themselves.

It also turns out that Chrome OS is already using espeak-ng and AudioWorklet to output the result (https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome). Still, -m flag does not appear to be set, so SSML parsing (which, from perspective here, alleviates the need to define speech synthesis events, etc.) is not possible using that code.

In any event, your closure of this issue/feature request ironically lead to revisiting prior interest in executing arbitrary shell commands using the browser as a medium https://gist.github.com/guest271314/59406ad47a622d19b26f8a8c1e1bdfd5.

guest271314 commented 4 years ago

@jan-ivar FWIW Initial implementation of proof-of-concept https://github.com/guest271314/native-messaging-espeak-ng.