w3c / mst-content-hint

This is the specification for the content-hint attribute. This optional hint permits MediaStreamTrack consumers such as PeerConnection or MediaRecorder to encode or process track media with methods more appropriate to the type of content that is being consumed.
https://w3c.github.io/mst-content-hint/
Other
4 stars 14 forks source link

Differentiate between speech for human and machine consumption #39

Closed sjdallst closed 4 years ago

sjdallst commented 4 years ago

There are different requirements for speech that is meant to be listened to by a human, and speech that is to be transcribed by a machine (signal preservation vs listening experience). The current spec leans towards speech that is meant for human ears. It would be helpful for applications seeking to do speech recognition if there was a more specific contentHint for that purpose.

sjdallst commented 4 years ago

Here is a PR with what this might look like. https://github.com/w3c/mst-content-hint/pull/40

guest271314 commented 4 years ago

Context defines

meant to be listened to by a human, and speech that is to be transcribed by a machine

The algorithm for the machine could include the requirement for lno modification of the signal, for the purpose of analyzing the applicable context itself.

The machine does not care if there are artifacts included in the input or not. Those artifacts are part of the context, and can prove valuable, depending on the use case.

The machine can only do what the human tells the machine to do. Since humans write code that runs machines, speech input or output is always meant for humans.

The human must always check the work of the machine, particularly when the domain is speech processing.

The machine could output a recording of a human. A human could output audio through a machine. Would caution against attempting to differentiate between machine input and output and human input and output. It could to some group of humans sitting a conference room deciding what is human or machine input and output for everyone outside of that room, based on their subjective perspectives.

The speech processing algorithm and in fact the consumer of speech input, rather human or machine, should only need to know that the domain is speech input, without attempting to perform some taxonomy upon the input, which, again, could lead to unintended consequences far beyond an API, while simultaneously using language in a specification: "They already defined that N output must be a machine and X output must be a human in their specification, must be a robot, a program, a machine, and not a human".

What are the observable differences between speech intended to be consumed by humans or machines?

Note, when lossy audio codecs are being used for the signal, the output is always from the machine, the lost context, which makes the output always lacking, being the ghost in the machine.

guest271314 commented 4 years ago

On the other hand, "speech" input can be in the form of code, not necessarily audio output by a human, or audio output to speakers or a device, rather, a stream of bytes; markup, e.g., SSML https://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/; International Phonetic Alphabet notation; ASML https://www.w3.org/community/synthetic-media/wiki/Articulatory_Synthesis_Markup_Language; or other data structures.

From the gist of this issue the consideration was for analysis of audio output input to a MediaStreamTrack (e.g., from a microphone), rather than a "codec" for different forms of streaming (encoded) "speech" data?

guest271314 commented 4 years ago

Differentiating between speech for human and machine consumption could be considered moot if we still cannot capture speech output from window.speechSynthesis.speak().

Consider an individual with speech impairment that uses Onboard http://manpages.ubuntu.com/manpages/bionic/man1/onboard.1.html to write and execute speak(), currently there is no specified way to capture that speech output.

https://stackoverflow.com/questions/45003548/how-to-capture-generated-audio-from-window-speechsynthesis-speak-call#comment82141734_45003549

Hi @guest271314, isn't this recording the user's mic - and not the actual synthesized speech? Is that what you intended? – Ronen Rabinovici Dec 1 '17 at 11:26

https://stackoverflow.com/questions/45003548/how-to-capture-generated-audio-from-window-speechsynthesis-speak-call#comment90686857_45003549

@guest271314, I used the code at plnkr.co/edit/PmpCSJ9GtVCXDhnOqn3D?p=preview but it still recorded from my microphone. – Jeff Baker Aug 15 '18 at 22:54

https://stackoverflow.com/questions/45003548/how-to-capture-generated-audio-from-window-speechsynthesis-speak-call#comment96709886_45003549

This doesn't record speaker output. I tried capturing tab audio using chrome extension but still failed. It seems speechSynthesis is not using HTMLmediaElement for audio hence we shall not be able to capture at tab/browser level. The audiooutput mentioned above returns "default " for both mic and speaker since there is no way to set "kind" field while setting constraints in getUsermedia, it always captures "mic". Let me know in case more details required. – Gaurav Srivastava Mar 4 '19 at 1:13

Support SpeechSynthesis to a MediaStreamTrack #69 https://github.com/WICG/speech-api/issues/69

Extending Media Capture and Streams with MediaStreamTrack kind TTS #654 https://github.com/w3c/mediacapture-main/issues/654

Support capturing audio output from sound card https://github.com/w3c/mediacapture-main/issues/629

Will continue to request this feature, though for reasons beyond this users' control: fraudulent rationale and actions of WICG/W3C, as evidenced by the prima facie absurdity and hypocrisy (after following their suggestions and instructions for joining the organizations) of the 1,000 year ban imposed as a reprisal by the organization from contributing to projects under the umbrella of that organization for not fitting into some predefined neat box Screenshot_2020-04-15_14-26-11

Perhaps your advocacy for TTS/SST technologies being implemented via MediaStreamTrack can help nudge that gap in coverage along to specification and impementation.

sjdallst commented 4 years ago

@guest271314 I agree with your point that there is a need to add a stream form of output for the Speech-API. I think there are some limitations with the current implementations that make this difficult (the OS speech generator is the thing speaking words to you instead of the Web Platform, so the browser currently has no access to anything like an audio stream). To fix this would require interest from the web community (and there seems to be some) as well as interest from implementers (not sure about this), and a solid path forward in the technical sense. Windows provides an audio stream in some its newer APIs which might be useable for this scenario... but I'm not sure about other platforms. Maybe it would be possible cross-platform with Chromium's chrome.ttsEngine API, which allows for developers to create their own TTS engines.

However, I think that this is off topic for the issue that this thread is based on. The issue above is pointing out an issue that developers are facing, where they can currently mark a stream for "speech" in the sense that it is being used for communications, but they would also like to mark a stream as being used for speech recognition, so that consumers, or even the platform, can make appropriate adjustments for that use case.

Appropriate adjustments for speech recognition include anything that will increase precision and accuracy of speech recognition machines/services. There has been some effort to standardize what scenarios should optimized with regards to speech recognition, and there are well established standards for communications. If you look through the requirements of both you will find differences between them that are at odds (one example is that communications combine adding pleasant background noise with noise suppression, which is at odds with the goal of signal preservation for most speech recognition engines).

Here are links to standards documents that illustrate some of the differences between the two use cases: Communication, Speech Recognition

guest271314 commented 4 years ago

@guest271314 I agree with your point that there is a need to add a stream form of output for the Speech-API. I think there are some limitations with the current implementations that make this difficult (the OS speech generator is the thing speaking words to you instead of the Web Platform, so the browser currently has no access to anything like an audio stream). To fix this would require interest from the web community (and there seems to be some) as well as interest from implementers (not sure about this), and a solid path forward in the technical sense.

Actually, all that is required should be for Chromium, Chrome to implement capturing "Monitor of <audio_device>" as Firefox does. Then users can capture audio output from speakers using getUserMedia(); see the linked Media Capture and Streams issues.

sjdallst commented 4 years ago

@alvestrand What do you think of adding another hint?

alvestrand commented 4 years ago

Closing this issue per #40. New proposals should be added as new issues.