guest271314 / SpeechSynthesisRecorder

Get audio output from window.speechSynthesis.speak() call as ArrayBuffer, AudioBuffer, Blob, MediaSource, MediaStream, ReadableStream, other object or data types
82 stars 19 forks source link

Microsoft "Natural" voices are not captured #20

Closed Enteleform closed 11 months ago

Enteleform commented 1 year ago

When setting utteranceOptions.voice to a "Natural" voice, the resulting audio contains only silence.

For example, these are the default voices that exist on an unconfigured installation of Microsoft Edge:

Microsoft Edge Voices
Microsoft David - English (United States)
Microsoft Mark - English (United States)
Microsoft Zira - English (United States)
Microsoft Natasha Online (Natural) - English (Australia)
Microsoft William Online (Natural) - English (Australia)
Microsoft Clara Online (Natural) - English (Canada)
Microsoft Liam Online (Natural) - English (Canada)
Microsoft Sam Online (Natural) - English (Hongkong)
Microsoft Yan Online (Natural) - English (Hongkong)
Microsoft Neerja Online (Natural) - English (India) (Preview)
Microsoft Neerja Online (Natural) - English (India)
Microsoft Prabhat Online (Natural) - English (India)
Microsoft Connor Online (Natural) - English (Ireland)
Microsoft Emily Online (Natural) - English (Ireland)
Microsoft Asilia Online (Natural) - English (Kenya)
Microsoft Chilemba Online (Natural) - English (Kenya)
Microsoft Mitchell Online (Natural) - English (New Zealand)
Microsoft Molly Online (Natural) - English (New Zealand)
Microsoft Abeo Online (Natural) - English (Nigeria)
Microsoft Ezinne Online (Natural) - English (Nigeria)
Microsoft James Online (Natural) - English (Philippines)
Microsoft Rosa Online (Natural) - English (Philippines)
Microsoft Luna Online (Natural) - English (Singapore)
Microsoft Wayne Online (Natural) - English (Singapore)
Microsoft Leah Online (Natural) - English (South Africa)
Microsoft Luke Online (Natural) - English (South Africa)
Microsoft Elimu Online (Natural) - English (Tanzania)
Microsoft Imani Online (Natural) - English (Tanzania)
Microsoft Libby Online (Natural) - English (United Kingdom)
Microsoft Maisie Online (Natural) - English (United Kingdom)
Microsoft Ryan Online (Natural) - English (United Kingdom)
Microsoft Sonia Online (Natural) - English (United Kingdom)
Microsoft Thomas Online (Natural) - English (United Kingdom)
Microsoft Aria Online (Natural) - English (United States)
Microsoft Ana Online (Natural) - English (United States)
Microsoft Christopher Online (Natural) - English (United States)
Microsoft Eric Online (Natural) - English (United States)
Microsoft Guy Online (Natural) - English (United States)
Microsoft Jenny Online (Natural) - English (United States)
Microsoft Michelle Online (Natural) - English (United States)
Microsoft Roger Online (Natural) - English (United States)
Microsoft Steffan Online (Natural) - English (United States)

 
The first 3 voices record as expected, but none of the subsequent "Natural" voices are captured.

Is there an additional step that must be taken in order for these voices to be captured?

guest271314 commented 11 months ago

Chrome sends a remote request for Google voices. Looks like Microsoft Edge is doing that, too. Notice the "Online" in the voice name.

I would suggest looking in to the URL that is being requested, then you can make the request yourself, see https://github.com/guest271314/GoogleNetworkSpeechSynthesis.

Unless you are doing something like what is described here https://github.com/guest271314/SpeechSynthesisRecorder/issues/17 or here https://github.com/edisionnano/Screenshare-with-audio-on-Discord-with-Linux or here https://github.com/guest271314/captureSystemAudio#pulseaudio-module-remap-source you are probably recording the microphone using SpeechSynthesisRecorder. See the pinned issues.

Enteleform commented 11 months ago

Thanks for the info! For the project I was working on when I submitted the issue, I ended up using this:
https://github.com/Microsoft/cognitive-services-speech-sdk-js

guest271314 commented 11 months ago

This project collects data and sends it to Microsoft to help monitor our service performance and improve our products and services.

Doesn't sound appealing to me.

Are external requests being made for speech synthesis?

Enteleform commented 11 months ago

Yes. It requires an Azure account and API key to initiate requests via the JavaScript API.

It worked out pretty well for my use case. I ended up requiring precise pronunciation, which can be controlled via phoneme usage.

guest271314 commented 11 months ago

I'm interested in local speech synthesis processing.

guest271314 commented 11 months ago

Have you asked Microsodt to release their speech synthesis engine to the public as FOSS?

Enteleform commented 11 months ago

Have you asked Microsodt to release their speech synthesis engine to the public as FOSS?

This seems very unlikely since Azure is a significant source of revenue for Microsoft. They have a decent free tier though, so it works fine for personal projects.

Feel free to close this issue if you feel that it's out of scope for the project.

guest271314 commented 11 months ago

Since Chromium authors refuse to capture monitor devices with navigator.mediaDevices.getUserMedia() we have to create an audio device that maps to speakers or other device output and set that is an input device so we can capture the output with navigator.mediaDevices.getUserMedia().

If you are requesting remote speech synthesis, you might as well bypass the middle-man and request the speech synthesis directly from the remote servers.

I would have archived this repository by now, however it is possible to remap to a virual device as detailed above.

I suspect Microsoft Edge is using an extension with a background HTML <audio> element to play the sounds, so your microphone is not catching the output.

I'll leave it up to you to close the issue.