Identifying the use of synthesised voices for pre-recorded audio

HadrienGardeur commented 2 months ago

ONIX supports the concept of unnamed persons in order to identify that a contributor is actually:

Unknown, anonymous or a group of various contributors
a ~~TTS~~ synthesised voice (male, female, unspecified or based on a real voice actor)
or an AI

In our techniques for full audio, displaying that info to the user seems extremely important. This is useful across audiobooks, EPUB and Daisy files where knowing whether pre-recorded audio is human narrated or a synthesised voice could impact the user's decision to select a publication.

Since we lack the ability to express this information in EPUB, we should also explore how this could be represented as well (probably by refining media:narrator).

wareid commented 2 months ago

TTS is on the reading system side though, not the EPUB/file side, so I don't think it's valid to push this requirement or metadata into the EPUB. I'd separate that out, but knowing if the synchronized audio or audiobook is AI-narrated would be beneficial, I agree with that.

HadrienGardeur commented 2 months ago

TTS is often used to automate the production of reflowable EPUB with media overlays or audiobooks. I'm not talking about TTS by the reading system here, but pre-recorded audio produced with a TTS engine.

madeleinerothberg commented 2 months ago

Long, long ago we had attributes of recorded and synthesized that could be applied to access features. The use case wasn’t really there so it was not brought forward to the more recent versions of the standard. Maybe the time is now to decide the best way to express that in our current metadata and the necessary terms.

-Madeleine

From: Hadrien Gardeur @.> Date: Wednesday, September 11, 2024 at 10:22 AM To: w3c/publ-a11y @.> Cc: Subscribed @.***> Subject: Re: [w3c/publ-a11y] Identifying the use of TTS for pre-recorded audio (Issue #400)

TTS is often used to automate the production of reflowable EPUB with media overlays or audiobooks. I'm not talking about TTS by the reading system here, but pre-recorded audio produced with a TTS engine.

— Reply to this email directly, view it on GitHubhttps://github.com/w3c/publ-a11y/issues/400#issuecomment-2343827984, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC2TKFPLK6AIEBOE3Y7HRZDZWBG4BAVCNFSM6AAAAABOBB275SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTHAZDOOJYGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Madeleine Rothberg Senior Subject Matter Expert +1 (617) 300-2492

wareid commented 2 months ago

Instead of confusing terminology with a re-use of TTS, maybe it's better/clearer to specify "computer generated", "AI generated", etc.?

HadrienGardeur commented 2 months ago

Synthesised voices then? I wouldn't use AI for that, as only a subset of these voices use ML/AI at all.

wareid commented 2 months ago

I do think there is a distinction to be made, since generally the TTS-generated audio is a bit more "mechanical" sounding vs AI-generated/more advanced language models that sound more "natural". We could differentiate on that even, since the goal is setting user expectation around what they will be purchasing/borrowing.

mattgarrish commented 2 months ago

Maybe the time is now to decide the best way to express that in our current metadata and the necessary terms.

Right, we added synchronizedAudioText as a feature, but that doesn't tell you anything about what kind of narration. I'm just not sure if this is a fit for a feature -- it probably belongs in the media overlays metadata.

I don't see that there's anything stopping us from recommending similar values (or pattern) be expressed in the media:narrator property to the ones @HadrienGardeur has already pointed out. Presumably, if you're okay with listing the name of the voice you used, you could swap that in for the generic gender descriptors. I expect users are going to assume a human narrator absent a synthesized label.

The problem with a refinement property off the narrator is I'm not sure everyone will want to list the name of the voice they're using, so then what are you refining in that case?

HadrienGardeur commented 2 months ago

I do think there is a distinction to be made, since generally the TTS-generated audio is a bit more "mechanical" sounding vs AI-generated/more advanced language models that sound more "natural". We could differentiate on that even, since the goal is setting user expectation around what they will be purchasing/borrowing.

Over the last few months, I've documented hundreds of voices available on various browsers/platforms and I can say that this is a tricky thing to do.

In my case, I used a quality property inspired by the values and descriptions returned by the Android API for voices.

We need to keep in mind that this is a fast-moving target and quality is constantly going up. A few years from now, knowing that this is based on an ML/AI based voice won't mean much as the quality profiles will have changed a lot. Since metadata are rarely updated (if ever), I wouldn't recommend listing such subjective information in an EPUB.

The problem with a refinement property off the narrator is I'm not sure everyone will want to list the name of the voice they're using, so then what are you refining in that case?

Yeah, that's definitely an issue with metadata in EPUB. If we had an object model (let's say JSON), we would simply add this information under narrator alongside a name. Some of them would include a name, others wouldn't.

clapierre commented 2 months ago

@HadrienGardeur I get where you are coming from, and the quality of these voices are really becoming impressive. But I think knowing at the very least if this was a human recording or generated will be an important distinction. Folks may want to seek out one over the other for a variety of reasons.

I also think that certain types of voices will allow to be sped up and knowing if this book enables that could be a benefit as well when deciding to purchase a specific book or not.

HadrienGardeur commented 2 months ago

[…] this was a human recording or generated will be an important distinction. Folks may want to seek out one over the other for a variety of reasons.

Do you mean if the voice was fully generated or based on human recording? ONIX does a fairly good job with a specific code for that: "Synthesised voice – based on real voice actor".

I agree that this is something useful and a good example where providing media:narrator with the name of the voice actor, but refining it with a code indicating a synthesised voice would work quite well.

Voice cloning is becoming increasingly common.

It's a key feature of the new ElevenLabs Reader, Amazon announced that they're rolling this out in beta for creator earlier this week, it's at the core of StoryTel's experiment and Apple-silicon based devices even allow users to do that with support for what they call "Personal voices".

Aside from the usual companies in this field, there are also a number of open source voice cloning models as well.

mattgarrish commented 2 months ago

If we had an object model (let's say JSON), we would simply add this information under narrator alongside a name.

Dare to dream! 😉

I'm sure we could always hack something together for epub with an eye at creating proper metadata for richer formats. Maybe the hack here would be to use "null" as a placeholder, so you might get something like:

<meta property="media:narrator" id="#nar01">null</meta>
<meta property="media:voiceType" refines="#nar01">synthesized</meta>
<meta property="schema:gender" refines="#nar01">female</meta>

I don't particularly like it, but if this falls into the display guidance maybe that's the place to handle the presentation to users.

Alternatively, you could bump a property like gender up to be the default name when one isn't provided, as that can be more important to some people than whether the voice is synthesized (e.g., if you have high frequency hearing loss, male voices are usually easier to comprehend). That would avoid "null" getting displayed by any reading system/vendor that just picks out the narrator property, at least.

At any rate, there are always ways to work around epub's metadata.

HadrienGardeur commented 2 months ago

That would avoid "null" getting displayed by any reading system/vendor that just picks out the narrator property, at least.

I was checking ONIX descriptions for these codes again and they're all limited to "read by", which got me thinking.

Currently, media:narrator indicates the presence of a narrator AND contains the name of the narrator at the same time.

What if instead of a media:voiceType we had a property that indicated:

that there's a synthesised narrator
AND contained the gender as a value

With a synthesised voice based on a real voice, we could use that property to refine media:narrator:

<meta property="media:narrator" id="#nar01">John Doe</meta>
<meta property="media:synthetisedNarrator" refines="#nar01">male</meta>

Whereas with a synthesised voiced that's not based on a real voice, we could simply omit media:narrator and use this property directly:

<meta property="media:synthetisedNarrator">male</meta>

chrisONIX commented 2 months ago

Hello, just wanted to clarify something in regards to ONIX for those less familiar with it and that does not take anything away from this interesting discussion.

I was checking ONIX descriptions for these codes again and they're all limited to "read by", which got me thinking.

In ONIX list 17 the role code E07 - is labelled as “Read By” as this was the standard way of labelling on audio books when ONIX 2 was first released back in 2001- this is an original code. There is also an original code E03 labelled “Narrator” - used in the sense of a person who advances the action ai a dramatised production. Both these codes were added in the very first release of ONIX before the wide spread use of “narrator” for an actor reading an audiobook - it is always important to look at the concept of any of the codes in ONIX - and it works for artificial voices as well as real actors.

Also the notes that say use the synthesised voice codes with E07 - is advice - to point people to potential use - it is not limiting or restricting use with other role codes and we can add other cases as advice when they are suggested - for the moment E07 is the most common usage for these codes.

https://ns.editeur.org/onix/en/19

Thanks

Christopher Saynor EDItEUR United House, North Road London N7 9DP UK Tel: +44 20 7503 6418

The information contained in this e-mail is confidential and may be privileged. It is intended for the addressee only. If you are not the intended recipient, please inform the sender and delete this e-mail immediately. The contents of this e-mail must not be disclosed or copied without the sender's consent. We cannot accept any responsibility for viruses, so please scan all attachments. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the company.

EDItEUR Limited is a company limited by guarantee, registered in England no 2994705. Registered Office: United House, North Road, London N7 9DP, UK. Website: https://www.editeur.org

For Thema codes https://ns.editeur.org/thema/ For ONIX codes https://ns.editeur.org/onix/

On 11 Sep 2024, at 23:07, Hadrien Gardeur @.***> wrote:

That would avoid "null" getting displayed by any reading system/vendor that just picks out the narrator property, at least.

I was checking ONIX descriptions for these codes again and they're all limited to "read by", which got me thinking.

Currently, media:narrator indicates the presence of a narrator AND contains the name of the narrator at the same time.

What if instead of a media:voiceType we had a property that indicated:

that there's a synthesised narrator
AND contained the gender as a value

With a synthesised voice based on a real voice, we could use that property to refine media:narrator:

John Doe male

Whereas with a synthesised voiced that's not based on a real voice, we could simply omit media:narrator and use this property directly:

male

— Reply to this email directly, view it on GitHubhttps://github.com/w3c/publ-a11y/issues/400#issuecomment-2344786283, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A5R4HEHUOP5FWUQ4RRJT4E3ZWC5KXAVCNFSM6AAAAABOBB275SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUG44DMMRYGM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

w3c / publ-a11y

Identifying the use of synthesised voices for pre-recorded audio #400