Word level text/audio synchronization granularity

burrage commented 7 years ago

This issue is a Bug(?)/Question

Related issue(s) and/or pull request(s)

Expected Behaviour

In the ReadiumJS Viewer UI (Chrome App as well), there is a section for text/audio synchronization granularity that doesn't appear to be operational:

media_sync_granularity

I'd want Readium, on word-level granularity, to play the audio of the single word as defined in the SMIL and then stop (eg. clicking on the word "tub" narrates "tub" and then stops).

Could I modify the JS code to do this now? I noticed these UI items would probably relate to calls like this in EpubReaderMediaOverlays.js:

readium.reader.updateSettings({ doNotUpdateView: true, mediaOverlaysSynchronizationGranularity: "word" });

Observed behaviour

Readium plays the clicked word and all subsequent words on the page.

Steps to reproduce

1.) Load EPUB with Media Overlay definitions. 2.) Click on a word mid-sentence.

Test file(s)

Can share demo EPUB if needed-- let me know!

Product

Readium cloud reader app
- latest development build uploaded at https://readium.firebaseapp.com)
- Chrome 56.0.2924.87

Additional information

danielweck commented 7 years ago

The SMIL (XML) files in your EPUB3 Media Overlays need to contain the necessary word / sentence / paragraph semantics, using the epub:type attribute on nested seq elements. If you already have word-level narration (i.e. the finest level of text-audio synchronisation), simply ensure that both your HTML markup and your SMIL documents include element wrappers for sentence and paragraph levels (nested structures).

Note that this is a Readium-specific feature which is not part of the official EPUB3.x standards. There was a discussion in the IDPF working group quite a while ago about this, but the concept never took off. Although this functionality is implemented in Readium (and works pretty well, including multi-level highlighting and page-boundary tracking), I am not aware of any available commercial / real-world publications that make use of this feature.

I am closing the issue, but please feel free to continue the discussion. I will try to find some demo content.

danielweck commented 7 years ago

Additional note: in terms of user experience, what I like about multiple synchronisation granularities (in this case: word, sentence, paragraph) is that this affects not only text highlighting (including the possibility of underscoring single words within a larger highlighted section, which aids reading / language comprehension), but this also affects user interaction: mouse clicks / finger touches (which trigger playback at the pointed location) are mapped with the chosen granularity, so that the user can interact directly with either single words, or larger blocks of text. Users can also invoke the previous / next playback commands which are mapped with the chosen word / sentence / paragraph level too (even when individual words are visually underscored within highlighted portions, using regular CSS styling). And finally, long form reflowable text that breaks across page boundaries can be tracked word-by-word instead of losing text/audio visual synchronization halfway through a split paragraph.

burrage commented 7 years ago

@danielweck Oooo, this is making a lot more sense now! And totally agree on the superb UX this feature adds-- great for language comprehension as you mentioned!

Adding the epub:type="word/sentence" definition to each nested sequence worked like a charm for allowing the UI buttons in the Viewer to be selectable.

But when I select "Word" level, it still plays everything so I think I'm still missing something from my understanding in how I would need to write the SMIL/XHTML.

I've uploaded a very short example with 2 sentences/4 words if you wouldn't mind having a quick look?

SMIL: Example SMIL XHTML: Example XHTML

Is it because there are no wrapper IDs at the sentence level in the XHTML that are associated on the SMIL side? I get for individual words within the par elements you're linking to the ID in the XHTML for the reference but do I need to do something similar with the sentence and word seq elements too?

And a quick follow up question-- if all EPUBs within a ReadiumJS cloud environment were to be author defined down to the word level, is it possible to make the text/audio synchronization granularity within the Viewer to have "word" text/audio synchronization as the default?

Thanks so much, Daniel! Much appreciated.

danielweck commented 7 years ago

I forgot to mention seq@epub:textref="chapter.html#fragment_id" (which is part of the regular EPUB3 Media Overlays specification)

danielweck commented 7 years ago

In your OPF package file, you can have: <meta property="media:active-class">-epub-media-overlay-active</meta> and <meta property="media:playback-active-class">-epub-media-overlay-playback-active</meta> ...and then add the corresponding CSS classes to create a highlighting style.

danielweck commented 7 years ago

At playback time, the CSS class specified for media:active-class is "attached" to the spoken HTML element. The CSS class specified for media:playback-active-class is "attached" to the root 'html' element.

danielweck commented 7 years ago

Now, here's where Readium's implementation is kind of "hacky": a special hard-coded CSS class called mo-sub-sync is attached to the lowest granularity level during playback, typically at the word level.

danielweck commented 7 years ago

So, to conclude, you can create some clever styling using CSS selector combinations such as:

html:(not.-epub-media-overlay-playback-active) body
{
    background-color: white;
    color: black;
}
html.-epub-media-overlay-playback-active body
{
    background-color: silver;
    color: red;
}

and:

html.-epub-media-overlay-playback-active *.-epub-media-overlay-active
{
    color: magenta;
}

html.-epub-media-overlay-playback-active *.-epub-media-overlay-active.mo-sub-sync,
html.-epub-media-overlay-playback-active *.-epub-media-overlay-active *.mo-sub-sync
{
    color: blue;
    border-bottom: 2px solid orange;
}

burrage commented 7 years ago

@danielweck Ah OK so it seems like everything was set up A-OK in the SMIL/XHTML and I see what's happening with Readium when you switch the granularity level in terms of highlighting with the addition of the epub:type="" definitions.

It seems like it is still not doing what I want it to do though in terms of stopping narration on a word by word basis which is what I thought the text/audio synchronization granularity was for.

With Media Overlays, is it not possible to only read a single word when clicking on that word? From your earlier reply, it sounded like it is possible?

For a concrete example, let's use the above screenshot as the use case.

If I clicked on the word "Go", is it possible to only narrate the word "go", highlight the word "go" as the active playback word, and then stop?

The SMIL would be as follows:

... 
      <seq epub:type="sentence" epub:textref="p1.xhtml#page1_sentence1">
          <par epub:type="word">
            <text src="p1.xhtml#smil_p1w1"/>
            <audio clipBegin="0.125s" clipEnd="1.095s" src="audio/scene1_child_stanza.mp3"/>
          </par>
          <par epub:type="word">
            <text src="p1.xhtml#smil_p1w2"/>
            <audio clipBegin="1.095s" clipEnd="1.995s" src="audio/scene1_child_stanza.mp3"/>
          </par>
...
      </seq>

And the XHTML:

...
    <p id="page1_sentence1">
        <span id="smil_p1w1">Go</span> <span id="smil_p1w2">to</span> ... 
    </p>
...

From what's happening on my end, when clicking on a word, it uses that word as the starting place to begin narration and highlighting and will run through all the narration and highlighting on a word by word basis until it reaches the end of the SMIL definitions.

Thanks again for the thorough responses, Daniel!

danielweck commented 7 years ago

The functionality you describe (i.e. play word when clicked/touched, stop immediately after word has played ... replace word with sentence, paragraph, or any other level of granularity) is not a standard Media Overlay feature per-se, as EPUB3's SMIL-synchronized text/audio is primarily designed for linear continuous narration. I can see how this reading/listening experience would be useful in some cases, but I am not sure how exactly this would be implemented (pause, resume, skippability, escapability, etc.), and standardised in all reading systems. I would say: better use some kind of JavaScript to trigger single-fragment playback.

burrage commented 7 years ago

Ah, OK. Makes sense that the primary design is linear continuous narration.

As you probably gathered, the use case here is for early literacy/children's books. Basically trying to mimic "Read To Me" (which Media Overlays handle really well!) and "Read To Myself" functionality with the optional ability to have the word spoken/highlighted on demand if it is an unfamiliar word.

JavaScript handling of the single-fragment playback seems to be the path forward!

Thanks Daniel!

burrage commented 7 years ago

@danielweck Actually a question for you so I'm off on the right track with this and not jumping down the wrong rabbit hole here...

When going from par to par audio nodes as Readium does, is there some sort of return function I can utilize within an eventListener that calls the readium.reader.pauseMediaOverlay(); function once the node is finished playing (hits clipEnd)?

I'm trying to leave as much of the ReadiumJS code untouched and do everything from within JS files associated with the EPUB if I can. And I think I can preserve both functionalities described in my last comment by creating a UI button within the EPUB that will read the page in a linear continuous fashion if desired and just toggle the Reader Viewer player controls through an eventListener when tapping individual words... just need to figure out when I could toggle the controls!

MarionBerthaut commented 6 years ago

Hi @danielweck

We need exactly the same option. We call it "Tap to say".

Tap to say : single-fragment playback. Play when clicked/touched, and stop immediately after. Tap to play : linear continuous narration. Begin the narration from the starting place, when clicked/touched, and stop at the end of the audio file.

We follow your advice : we use some kind of JavaScript

`

    function suiviAudioActive() {
    //clic sur rhèse
    $('.rhese').on(eventend, function(e) {
        if (window[nameSuiviAudioON] == "true") {
            e.preventDefault();
            stopAudioSuivi();
            $('.rhese').removeClass('suiviAudio');
            // Config overlay + audio
            $(this).addClass('suiviAudio');
            var numRhese = $('.rhese').index(this);

            var audio = document.getElementsByClassName('audioNarration')[0];

            audio.currentTime = tableauCurrentTime[numRhese];
            var firtCurrentTime = tableauCurrentTime[numRhese];
            audio.play();
            var timeInMsBegin = Date.now();
            $(".audioNarration").bind("timeupdate canplay", function() {
                var timeInMsRefresh = Date.now();
                var Delta = (timeInMsRefresh - timeInMsBegin)/1000;
                var ActualTime = firtCurrentTime + Delta;

                if (ActualTime > tableauCurrentTime[numRhese + 1]) {
                    stopAudioSuivi();
                    $('.rhese').removeClass('suiviAudio');
                }
            });
        }
    });
}`

As you can see, our granularity is rhesis and not words. We define in tableauCurrentTime the smil timers of each rhesis.

The problem is that it's not a reliable solution, not precise enough. Depending on both the device and the reader, the audio doesn't stop at the right moment. It seems that checking the date.now() is part of the problem. Is there any solution to play audio "during a determined duration" ?

Thank you in advance

danielweck commented 6 years ago

Yes, playing short audio clips as a "range" of bytes from a much larger file (MP3, MP4, etc.) is actually more tricky than it sounds, mostly because of cross-platform / cross-browser issues.

That's why the Readium "audio player" logic is a bit more complicated than it really needs to be: https://github.com/readium/readium-shared-js/blob/develop/js/views/audio_player.js#L32

The "audio" HTML5 element / API is used to buffer audio samples, and to seek to accurate time stamps. Maybe the WebAudio API is a better choice nowadays.

readium / readium-js-viewer