Implement MediaSession API to support voice interactions for video/media playback

mcomella commented 6 years ago

User benefit

On the new Fire TV cube device, you can interact with video/media using voice and the MediaSession API. This would be convenient for users.

Requirements

Investigate what we'd have to do to integrate
Integrate if possible

mcomella commented 6 years ago

An additional link: https://developer.amazon.com/docs/fire-tv/mediasession-api-integration.html

mcomella commented 6 years ago

Looking briefly into this, it looks like we need to instantiate a MediaSessionCompat instance and add some callbacks. However, WebView doesn’t appear to have APIs to query/modify audio/video playback so we might need to do this through JavaScript.

mcomella commented 6 years ago

fwiw, there is a MediaSession JS API which makes this a little harder to google. :)

mcomella commented 6 years ago

The code we write could be usable by others trying to integrate MediaSession with WebView: we should consider making it a library.

mcomella commented 6 years ago

I'd estimate this is a size M though it could turn into a size L if the JS turns out to be difficult/fragile.

mcomella commented 6 years ago

I think this is as straight-forward as creating a MediaSession and registering callbacks. However, one unclear thing: when do we need to call release? The docs seem to assume that media Activities will playback for their whole lifecycle (thus in onDestroy) but videos can be created on each page: should we create a new session for each page? Here's the code I wrote in onCreate to test without a device:

        mediaSession = MediaSessionCompat(this, "lol")
        val pb = PlaybackStateCompat.Builder()
                .setActions(PlaybackStateCompat.ACTION_PLAY or
                        PlaybackStateCompat.ACTION_PAUSE)
                .build()

        mediaSession.setPlaybackState(pb)
        mediaSession.setCallback(object : MediaSessionCompat.Callback() {
            val browserFragment get() = supportFragmentManager.findFragmentByTag(BrowserFragment.FRAGMENT_TAG) as BrowserFragment?
            override fun onPlay() {
                Log.d("lol", "play called")
                (browserFragment?.webView as FirefoxAmazonWebView?)?.evalJS("var vid = document.getElementsByTagName('video')[0]; vid.play();")
            }

            override fun onStop() {
                Log.d("lol", "pause called")
                (browserFragment?.webView as FirefoxAmazonWebView?)?.evalJS("var vid = document.getElementsByTagName('video')[0]; vid.pause();")
            }
        })

        val contr = MediaControllerCompat(this, mediaSession)
        MediaControllerCompat.setMediaController(this, contr)

        // Test code to call callbacks
        contr.transportControls.play()

mcomella commented 6 years ago

Related notes:

The remote media buttons work for free on youtube.com/tv but not on other sites: I assume youtube handles the button events themselves.
Before this implementation, "Alexa, play" doesn't seem to work consistently on youtube.com/tv (which handles media buttons) and doesn't work at all on other sites. I'm guessing Alexa is, by default, sending media button events when these commands are spoken.

"Other sites" is being tested with this site: https://www.w3.org/2010/05/video/mediaevents.html

mcomella commented 6 years ago

"Alexa, play" doesn't seem to work consistently on youtube.com/tv (which handles media buttons)

I filed #936 for this issue. "Alexa pause" seems to send media button events, it seems like we'd only need to implement the MediaSession APIs (this bug) to get media playback on pages that don't support media button events themselves: #935 is to implement this for the hardware remote media buttons (which would be using the same code as MediaSession to find videos on the page and stop them).

mcomella commented 6 years ago

To summarize:

If we fix #936, we'll fix "Alexa pause/play" functionality on youtube.com/tv (but not other MediaSession functionality like seeking, next/previous, restart).
This bug will create voice support for seek, next/previous, and restart. We can decide if we want to do this only for youtube (it's simpler to code specifically to its interface instead of the general case) or for all sites.
935 is media button support for non-youtube sites (i.e. ones that don't support media buttons themselves): it will share code with this bug, if we support non-youtube sites.

@Sdaswani, what are we trying to accomplish with this bug? For voice, do we want to support: 1) Just play/pause on youtube 2) Play/pause/seek/prev-next/restart on youtube 3) ^ on all sites

mcomella commented 6 years ago

@Sdaswani Also, do we care about audio or just video for now?

mcomella commented 6 years ago

Alexa, without MediaSession, is sending media key events: https://github.com/mozilla-mobile/firefox-tv/issues/936#issuecomment-398216178

mcomella commented 6 years ago

I managed to get MediaSession callbacks to start working: we had to call requestAudioFocus. This means we'll have to figure out all the details of what calling this method, and thus being a "media app", entails.

mcomella commented 6 years ago

I managed to get MediaSession callbacks to start working: we had to call requestAudioFocus

And we have to call:

val mediaController = MediaControllerCompat(this, mediaSessionCompat)
        MediaControllerCompat.setMediaController(this, mediaController)

mcomella commented 6 years ago

Next steps:

Questions (from https://github.com/mozilla-mobile/firefox-tv/issues/930#issuecomment-398212242 and below):
- Do we want to focus on both audio and video or just video?
- How thorough do we want this integration to be?
- My current assumption is: do video only, full integration
Figure out requirements to comply with being a media app; here are the docs:
- Amazon: intro to audio focus
- Amazon: specific requirements for media apps
- Amazon: testing audio focus
- Google: building a video player
- Google: building a video player callbacks
- Google: handling media buttons
- Google: managing audio focus
- Google: example media service
Add MediaSession init code
requestAudioFocus at appropriate times and handle other media app requirements
Add JS code to modify playback state
(optional) Optimize voice: https://developer.amazon.com/docs/fire-tv/mediasession-api-integration.html

mcomella commented 6 years ago

Notes on being a good media app:

"For video apps, your app should pause playback when voice capabilities are invoked (in your OnPause() method). When your app resumes, playback should remain paused until the user presses the PLAY button on the remote." (via here) This isn't what the Media Session Sample App does.
"The time period between your requestAudioFocus() and abandonAudioFocus() audio focus calls must exactly match the time you handle the focus change callback." (via requirements)
We should use the requirements doc as a checklist: it's quite thorough.

Some current behavior: Video:

On youtube, we don't stop on voice commands (but it's janky: see #936).
On HTML5 demo video, we stop playback on voice commands
On youtube, we ? when the app is backgrounded
On HTML5 demo video, we stop playback when the app is backgrounded
We don't stop audio playback on voice commands

Audio:

On soundcloud, we stop playback on voice commands. I'm unable to resume playback (their interface is strange).
On soundcloud, we don't stop playback when the app is backgrounded (then we should really respond to media buttons here).

mcomella commented 6 years ago

@Sdaswani This is going to take longer than expected: I can't get the MediaSession API to work consistently. It's hard to make accurate estimates because I don't fully understand what's required to implement the API.

My current code is here. When I restart the device (i.e. get a clean slate), my code does not work (the onPlay/Pause methods are not called). However, if I open the Media Session Sample app first and then open my app, it works correctly – I'm guessing I'm not correctly managing the audio focus state and it's making my app interact with the system strangely.

I plan to dig into the Google sample code next to see if I can figure out what I'm missing.

mcomella commented 6 years ago

I can't get the MediaSession API to work consistently.

The Google sample app, Universal Media Player, does not work in the same ways that my code does not work. Also, Amazon's Media Session Sample App will not work on the first "Alexa pause" after a device restart (saying it again works).

This makes me wonder if I should just go full steam ahead with my code, that sort of works, and figure out this issue later.

mcomella commented 6 years ago

Another issue I found: when I rebooted the device, opened the sample app (paused twice to get my app to work), opened my app to test and specified, "Alexa pause", my app didn't appear to receive the media events but the google sample app did (it started to play music), which doesn't really make sense to me (maybe they can register as a system provider that takes media events before the app is even opened?). But it's strange because I was granted audio focus (but "Alexa pause" causes me to lose it).

mcomella commented 6 years ago

Another weird experience: if I start the app, then go to a website with HTML video, and say, "Alexa play", nothing happens. However, if I start on the homepage, say, "Alexa play", the media command will be received. If I then go to the page with HTML video and say "Alexa play", the media command will be received.

~~I wonder if the WebView is managing audio focus somehow (in the former case, if I play the video, I will lose audio focus).~~

To summarize, to get my code to work, I need to (after a restart):

Open the Media Session Sample App, say "Alexa pause" twice (first time won't work)
~~Open Firefox, say "Alexa play" before web content~~

Media commands should now work.

edit: I fixed this by setting mediaSession.isActive = true before voice commands are issued.

mcomella commented 6 years ago

Actually, with mediaSession.isActive = true, I no longer need to launch the Media Session Sample App before voice commands start working (though I think the first one might be delayed).

Also, it doesn't appear that I need to requestAudioFocus, which is good because the WebView seems to steal audio focus from the app anyway (which means it's handling it for itself and we don't need to do any extra work there). :) This should simplify the implementation.

mcomella commented 6 years ago

Okay, I seem to have figured out why I've been getting inconsistent behavior. I needed to set mediaSession.isActive = true and this part:

        val pb = PlaybackStateCompat.Builder()
                .setActions(SUPPORTED_ACTIONS)
                .setState(PlaybackStateCompat.STATE_PLAYING, PlaybackStateCompat.PLAYBACK_POSITION_UNKNOWN, 1.0f)
                .build()

On an fresh run, the voice commands work correctly. However, if I set the state to PAUSED instead, Alexa states "What would you like to hear?" instead of executing the play command, which was unexpected to me. This brings up the question: should Alexa be able to start a video that was not playing through voice?

mcomella commented 6 years ago

Status update: I've got a close-to-done WIP for play/pause state. The only problem is that the web site is receiving the media button events in addition to our code running, which causes an undo to the playback state change. This WIP hacks around the MediaSession API, for implementation speed, but it appears to function correctly.

mcomella commented 6 years ago

Status update: play/pause, with the hack around the MediaSession API (state is always PLAYING), is up for review.

Next steps:

Investigate how much work it'd be to support the other commands: seek, next/previous, restart
See if we want to remove the brief pause after speaking a voice command

mcomella commented 6 years ago

@aminalhazwani "Alexa next/previous" will dispatch a "Media next/previous" keyboard button event. Some websites may handle this, e.g. youtube will advance to the next video, while others may not. When websites handle it, it's a good experience. However, when they don't, the user is left waiting for an action that will never happen. We can't tell when a website will handle the event but should we display a toast like, "Received next command" (with better copy), to notify the user their action has been received?

mcomella commented 6 years ago

Status update: I can get the device to acknowledge I want to include FF/rewind/next/previous commands but they're not getting delivered to my MediaSession: I wonder if it's because I'm not updating the playback position regularly.

edit: I restarted the device and now can fast/forward.

mcomella commented 6 years ago

Status update: I've opened a PR for a basic implementation that does not conform to the MediaSession APIs. I filed issues for follow-ups to better conform, which we can prioritize.

HiralSModi commented 6 years ago

We tested the 2.2 APK on FireTV Cube. Steps:

"Alexa, open Youtube" - This will launch the YouTube.com app.
Select Firefox. (version 2.2)
Then, select a video to play and click on it to start playback.
"Alexa, pause" - video paused
Again say "Alexa pause" Expect Behavior: Nothing happens Observed Behavior: Video resumes Summary: It doesn't take utterances accurately and toggles between pause and play. Sometimes FF was crashing while doing pause and play.

HiralSModi commented 6 years ago

crash_log_cube.txt Crash Log

Sdaswani commented 6 years ago

Thanks @HiralSModi I filed some new tickets: https://github.com/mozilla-mobile/firefox-tv/issues/965 https://github.com/mozilla-mobile/firefox-tv/issues/966

aminalhazwani commented 6 years ago

We can't tell when a website will handle the event but should we display a toast like, "Received next command" (with better copy), to notify the user their action has been received?

@mcomella sorry for the late reply, I missed the notification. Rather than a visual feedback we could opt for an instant voice feedback, something like "Ok" or "Got it". If possible we can then set a small timeout and the let users know that the website is not responding to the voice commands.

aminalhazwani commented 6 years ago

"Alexa, pause" - video paused

Again say "Alexa pause" Expect Behavior: Nothing happens

@HiralSModi rather than "Nothing happens" what if Alexa says something like "Video/Media is paused", or "Video is already paused"? So on first interaction we have a visual/audio feedback coming from the video/audio itself. On second interaction we have Alexa jumping into the conversation for better clarity.

mozilla-mobile / firefox-tv

Implement MediaSession API to support voice interactions for video/media playback #930

User benefit

Requirements

935 is media button support for non-youtube sites (i.e. ones that don't support media buttons themselves): it will share code with this bug, if we support non-youtube sites.