shaka-project / shaka-player

JavaScript player library / DASH & HLS client / MSE-EME player
Apache License 2.0
7.1k stars 1.33k forks source link

[ViacomCBS] Samsung fatal stall on preroll to first content period. #2620

Closed dsparacio closed 4 years ago

dsparacio commented 4 years ago

Have you read the FAQ and checked for duplicate open issues? Yes completely.

What version of Shaka Player are you using? Tested in Shaka 2.5.12, 2.5.11, 2.5.5 and 2.5.1 - Same results on all

Can you reproduce the issue with our latest release version? Yes

Can you reproduce the issue with the latest code from master? Yes

Are you using the demo app or your own custom app? Demo raw simple setup. Page urls included in support email with content and la urls.

If custom app, can you reproduce the issue using our demo app? yes

What browser and OS are you using? This issue is only being reported and reproduced by our team for SOME Samsung Tizen based Smart TV mainly years 2017 and 2018. 2019 and 2020 year seems unaffected. Some 2017 and 2018 are also working as expected!!

For embedded devices (smart TVs, etc.), what model and firmware version are you using?

* 2017 UN50MU6070 Fails 100% * 2018 UN55NU710D Fails 100% * 2018 UN32N5300AF PASSES * 2018 UN49NU8000 PASSES **What are the manifest and license server URIs?** Sent over on shaka-player-issues@google.com **What did you do?** * Play from 0 start time and let preroll finish, on the first transition to a content period player stalls. * When a samsung model year has the issue, it happens 100% * Does not happen with the exact same encode of the stream that is not passed though DAI. * Does not happen with the exact same encode pass to DAI but no DRM. * In logs. DRM keysystem, when we fail, both seem to init fine. I tested WV and PR here. * Non zero test are interesting. * If you skip the pre-roll and start past that second in stream... all mid-roll DAI period transition thereafter works as expected. * Stream is fine on all TVs we tested. * Dash.js does play the same DAI/DRM steam without issue on all the tested devices. * Shaka is much better overall quality of experience so it is preferred on Samsung I am also including logs in the support email with content. These logs will be from both fail and pass captures in a few different versions of Shaka mentioned above. **What did you expect to happen?** The first content period initializes and plays out. **What actually happened?** The content period fails to init and play properly.
dsparacio commented 4 years ago

One update here 2017 UN50MU6070 Fails 100% This fails differently than the 2018 tv. Same point in time. Logs have different recovery and loop to them. And sometimes another ad period starts to play. We never get into content periods.

2018 UN55NU710D Fails 100% Fails always at first period break... and stays there until page app is closed on TV. never picks up.

chrisfillmore commented 4 years ago

There are several issues related to ad playback on Tizen platforms, you can take a look at #1094 and follow the thread.

Ultimately, even with fixes in the JS/app layer, we still found we had playback problems on Tizen devices if we started playing clear content before unencrypted content. We attributed this to bug(s) in the Tizen media layer. So, our solution was to force playback to start during protected content. Are you able to skip your prerolls? I'll bet this would solve the problem.

dsparacio commented 4 years ago

@chrisfillmore Thanks for the added info and point me to that ticket. Same experience.

if you skip the pre-roll and start past that second in stream... all mid-roll DAI period transition thereafter works as expected.

Totally get that this is a Tizen EME spec/implementation issue. Side note, all the Issues we are having with Dash.js are different but equally annoying. I was a main contributor for years on Dash.js and understand the logic in the lib Most likely not something wrong in that application either. Tizen has so many variations in model years and the shipped firmware and other factors that make this so hard to debug.

Again, thanks so much for the feedback. Anything helps right now.

dsparacio commented 4 years ago

@chrisfillmore... forgot to mention..... Dash.js does not have an issue with DAI unexcrypted pre-roll period into ContentProtected CBS period. The lib initilizes DRM without issue. Same TV, same content. Dash.js then fails to stream or be stable thereafter where Shaka is 100% stable post pre-roll.

My point to bring this up is not to say one is better....clearly shaka is! Rather to ask, with 1094, although I have not read fully yet, seems to make it sound like it is not possible without adding Content Protectionin the DAI period?

TheModMaker commented 4 years ago

Dash.js resets the entire media pipeline on Period transitions. This allows it to switch between audio+video and video-only content. We keep the media pipeline initialized and only transition between compatible Periods. The browser may have trouble transitioning between clear and encrypted content and Dash.js is resetting the pipeline so it plays.

IIRC, there was a similar issue before where we couldn't change to fully encrypted content later. Do you add Shaka configuration for encryption (i.e. license server) all the time? After playback starts, do you see video.mediaKeys as null when starting in clear content? When you start with clear content, is a protected Period available in that manifest or does a protected Period get added in a manifest update? If we don't see a protected Period at all, we may not initialize EME and you sometimes can't add EME once playback has started.

dsparacio commented 4 years ago

@TheModMaker -

Do you add Shaka configuration for encryption (i.e. license server) all the time?

Yes, we config Shaka with DAI media url with LA urls for both Widevine and Playready.

After playback starts, do you see video.mediaKeys as null when starting in clear content?

Ill have to run this again and test that. Here is the top part of the log and Ill attach on log when working vs one when failing. Same content, both Tizen, different models. Not even years just models

indexed_db.js:36 IndexedDB.install
input_event.js:35 InputEvent.install
mathround.js:41 mathRound.install
mediasource.js:36 MediaSource.install
video_play_promise.js:35 VideoPlayPromise.install
vttcue.js:36 Using native VTTCue.
player.js:811 Starting attach...
player.js:1009 Starting load of https://dai.google.com/ondemand/dash/content/2497752/vid/19A7D434-8AC8-3835…54E32CBA735B/DLS/streams/330caf08-26eb-4bf5-aaa4-1beec7dc52bb/manifest.mpd...
player.js:1993 Found variant with audio and video content, so filtering out audio-only content in all periods.
drm_engine.js:929 Created MediaKeys object for key system com.microsoft.playready
player.js:2056 codecs avc1-mp4a avg bandwidth 2205296.1666666665
player.js:4138 onChooseStreams_ Object
player.js:4177 Choosing new streams after period changed
streaming_engine.js:458 init: completed initial Stream setup
drm_engine.js:628 Ignoring duplicate init data.
drm_engine.js:628 Ignoring duplicate init data.
drm_engine.js:628 Ignoring duplicate init data.
drm_engine.js:628 Ignoring duplicate init data.

When you start with clear content, is a protected Period available in that manifest or does a protected Period get added in a manifest update?

All of our content periods have both PlayReady and WV ContentProtection. We do not xlink update anything in the manifest nor are we doing any updates to manifest post first request. So all periods have all the content info already at start.

Dash.js resets the entire media pipeline on Period transitions.

Correct and when you tear down and reinit source buffers like Dash.js does, currerntTime sets back to 0 on video element for an instance. This is a split second time change causes much trouble with our internal AD rules.

dsparacio commented 4 years ago

@TheModMaker

Was thinking more about this and just had this come across my mind.

  1. We know Dash.js successfully transitions from “clear to protected” periods on all Samsung models
  2. We know Dash.js forces teardown and recreation of source buffers (all periods),
  3. We know Shaka does not do this for good reason, to have cleaner Multiperiod transitions.
  4. And it seems based on #1094 that there is a clear known issue on some models of Samsung that will not init DRM properly when a clear period is played first in the SAME source buffer.

So I am wondering, could the Shaka team add a config to set the player in a mode to tear down only the first source buffer when when the first clear to protection period transition happens? then keep it solid the rest of the presentation?

joeyparrish commented 4 years ago

I'm not sure how we could tear that down cleanly, or how we would make a special case for it. It would be a big change from what we do now.

I'll be trying out your content shortly in Shaka Player v3.0, in which the way we parse and represent multi-period DASH has completely changed. It's possible that this will make a difference.

joeyparrish commented 4 years ago

I get 404 errors for the mp4 segments in v3.0, but not in v2.5.12. I'm testing on ChromeOS as a baseline.

So that means we have a unique issue on v3.0 that may be unrelated.

dsparacio commented 4 years ago

@joeyparrish Thank you so much for your help. Let me know how I can help. If you need more device testing on a 3.0 version etc, we have many models available.

joeyparrish commented 4 years ago

We have fixed a bug in our DASH parser that was present in v3.0.0. Would you be willing/able to test the latest master branch code in your environment to see if this Tizen+DRM+ads issue is improved in master?

dsparacio commented 4 years ago

Not at all, @joeyparrish I will run these test tomorrow and get back to you asap.

dsparacio commented 4 years ago

@joeyparrish Tested both branches listed below both with no change, or same result, freezing, same spot, same log results.

v3.0.x branch version reported in console shaka.Player.version "v3.0.0-25-gdaa625d8-debug"

Master branch version reported in console shaka.Player.version "v2.5.13-master-15-g34337ca2-debug"

I wanted to make sure I have the right version, master's current version is 2.5.13.

Models Tested 2017 UN50MU6070 2018 UN55NU710D

dsparacio commented 4 years ago

All - we've run into the same issue on the XBOX platform with the Shaka player. Identical behavior as samsung with inability to transition from unencrypted ad content to encrypted main content at the preroll only.

Reproduced on these devices:

Sandbox ID: XDKS.1
OS version: 10.0.19041.3068 (rs_xbox_release_2006.200610-0000)
OS edition: June 2020
Xbox Live device ID: FD00A956242E1869
Console ID: 74f233ae.f9265a03.c4fee763.851eb008.01
Serial number: 210633693016
Console type: Xbox One S
Model: 1681
Devkit type: Universal Windows App Devkit
Sandbox ID: XDKS.1
OS version: 10.0.19041.3068 (rs_xbox_release_2006.200610-0000)
OS edition: June 2020
Console ID: A7DCC4D8.6B218CCF.694B68FC.0D3DB308.01
Model: 1681
Console type: Xbox One S
Devkit type: Universal Windows App Devkit
Sandbox ID: XDKS.1
OS version: 10.0.19041.3068 (rs_xbox_release_2006.200610-0000)
OS edition: June 2020
Xbox Live device ID: FD00812AA7534FE0
Console ID: 4c279407.7e461c98.d56bb52f.dfb64007.01
Serial number: 051235192517
Console type: Xbox One X
Devkit type: Universal Windows App Devkit
joeyparrish commented 4 years ago

Thank you for the extra info. We also have an Xbox One in the lab, but we've never had any automation or remote access for that platform, and we're all still working from home.

The Tizen device in our lab, however, will be accessible again soon. I've been working on repairing the infrastructure that will allow us to access it remotely in automated tests. It won't be as good as interactive debugging, but hopefully, with that and some concentrated effort on reproducing the issue through automation, it might be enough for us to make progress.

We apologize for the inconvenience and the time it has taken us to help with this, and we thank you again for your patience.

dsparacio commented 4 years ago

@joeyparrish Thank you for the quick response. Totally get it, we are suffering from the same issue here due to access to device labs and remote working. Hopefully this sheds a bit more light on this issue since not confined to Samsung implementation. As always, let me know how our team can help. Thank you for even looking into this for us. Very much appreciated and I know how this goes....

joeyparrish commented 4 years ago

We now have access to our Tizen device again, at least as far as automated tests in Karma/Jasmine are concerned. I'll try to focus on reproducing this situation in an automated test. After that, we may be able to use that as a platform to debug and make progress on a workaround.

joeyparrish commented 4 years ago

I'm now able to attempt to play your content using our automated tests. Since this is a long and obscure process, I'm going to try to record all the steps I'm taking in as much detail as I can.

We have a generic, automated playback test that we use on our demo content. For VOD, it plays the first and last 30 seconds the first 30 seconds and last 15 seconds of the content and then checks that there were no errors and that video.ended is true.

build/test.py has arguments for running custom assets through that playback test. It looks roughly like this:

./build/test.py \
  --test-custom-asset MANIFEST_URL \
  --test-custom-license-server com.widevine.alpha=WIDEVINE_SERVER_URL \
  --test-custom-license-server com.microsoft.playready=PLAYREADY_SERVER_URL

There was a bug in the test which prevented --test-custom-license-server from working, but a fix for that is now in review.

Since the MPD URL was not constant, I took the instructions you sent privately for generating the MPD URL, and I built a shell script that outputs just the URL, so I could easily run the test repeatedly.

./build/test.py \
  --test-custom-asset "$(./get-2620-url.sh)" \
  --test-custom-license-server "com.widevine.alpha=$(./get-2620-widevine-server.sh)" \
  --test-custom-license-server "com.microsoft.playready=$(./get-2620-playready-server.sh")

I got an EME failure on IE11 (MSPR_E_INVALID_CDMSESSION, same as #1689), and the test was skipped on Safari (no PlayReady or Widevine), but the content was playable on every other desktop platform, plus Chromecast and Android.

On Tizen, when the test seeks to 30 15 seconds from the end, playback stalls until the test times out. So I'll call that a repro!

joeyparrish commented 4 years ago

Test output:

Safari 3.0.0 (Tizen 3.0.0) Player plays DEMO_CUSTOM / custom : .../manifest.mpd FAILED
        Error: Timeout waiting for movement from 1490.662 to 1491.662
            at _class.waitUntilGeneric_ (test/test/util/waiter.js:166:19 <- test/test/util/waiter.js:203:19)
            at _class.waitUntilPlayheadReaches (test/test/util/waiter.js:90:17 <- test/test/util/waiter.js:111:19)
            at _class.waitForMovement (test/test/util/waiter.js:59:17 <- test/test/util/waiter.js:76:19)
            at Function.waitForMovementOrFailOnTimeout (test/test/util/util.js:423:19 <- test/test/util/util.js:682:21)
            at _callee3$ (test/player_external.js:149:24 <- test/player_external.js:213:29)
            at tryCatch (node_modules/babel-polyfill/dist/polyfill.js:6900:40)
            at GeneratorFunctionPrototype.invoke [as _invoke] (node_modules/babel-polyfill/dist/polyfill.js:7138:22)
            at <Jasmine>
            at step (test/player_external.js:3:191)
            at test/player_external.js:3:361
Safari 3.0.0 (Tizen 3.0.0): Executed 1 of 1 (1 FAILED) (57.734 secs / 54.374 secs)
joeyparrish commented 4 years ago

I added --uncompiled to run tests using the uncompiled sources and --enable-logging=debug to get log output from the device during the test run.

After seeking to close to the end of the stream, we wait for the playhead to advance by 1 second before checking additional expectations. When this happens, the logs show a time of 1490.67. Then the stall detector starts firing, each time with the playhead within a few ms of time 1489.104. Somehow, it went backward between those two events.

When the test times out, the logs show that the ready state is 4 (HAVE_ENOUGH_DATA), we're not in a paused or buffering state, and that the buffered range (1483.114 - 1501.103) includes the current time. The buffered range makes sense in combination with the stall detector firing, because a "stall" is defined there as any time the playhead is inside a buffered range, not paused, and not moving.

The streaming.stallEnabled config defaults to true, except on WebOS. If I expand that so that it also defaults to false on Tizen (lib/util/player_configuration.js), there is no noticeable change except that the stall detector logs go away.

Logs:

LOG LOG: 'Waiting for movement from 1490.67 to 1491.67'
LOG LOG: '(audio:1)', 'clear: handling right now'
LOG LOG: '(audio:1)', 'clearing buffer'
LOG LOG: '(video:7)', 'clear: handling right now'
LOG LOG: '(video:7)', 'clearing buffer'
LOG LOG: '(audio:1)', 'cleared buffer'
LOG LOG: '(video:7)', 'cleared buffer'
WARN LOG: 'No variants met the ABR restrictions. Choosing a variant by lowest bandwidth.'
LOG LOG: 'Calling switch_(), bandwidth=3958 kbps'
LOG LOG: 'switch_'
LOG LOG: 'Stall detected at 1489.104 for 1.002000093460083 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.106 for 1.001000165939331 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.206'
LOG LOG: 'Stall detected at 1489.104 for 1.001999855041504 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.104 for 1.001000165939331 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.106 for 1.000999927520752 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.206'
LOG LOG: 'Stall detected at 1489.107 for 1.002000093460083 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.206999'
LOG LOG: 'Stall detected at 1489.106 for 1.001999855041504 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.206'
LOG LOG: 'Stall detected at 1489.105 for 1.001000165939331 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.205'
LOG LOG: 'Stall detected at 1489.104 for 1.000999927520752 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.104 for 1.002000093460083 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.104 for 1.002000093460083 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.104 for 1.000999927520752 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.104 for 1.002000093460083 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
LOG LOG: 'Stall detected at 1489.105 for 1.000999927520752 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.205'
LOG LOG: 'Stall detected at 1489.104 for 1.002000093460083 seconds. Seeking forward 0.1 seconds.'
LOG LOG: '(all): seeked: buffered seek: presentationTime=1489.204'
ERROR LOG: 'Timeout waiting for movement from 1490.67 to 1491.67', 'current time', 1489.104, 'duration', 1505.67, 'ready state', 4, 'playback rate', 1, 'paused', false, 'buffered', [Object{start: 1483.114, end: 1501.103}]
joeyparrish commented 4 years ago

Interestingly, none of the seeks triggered by the stall detector had any effect on the playhead position. That may be an indication of the media pipeline being hung in some fundamental way under the hood of the browser.

The content has a ~30-second ad at the beginning, followed by a 30 second encrypted content period. The final period is encrypted main content, so seeking to duration - 15 definitely puts us in an encrypted segment.

I modified the test to pass a start time to load(), to start playback in encrypted main content instead of the first ad. The first main content period has 6-second segments, so I'm starting playback at time 40, which should be in the second segment of the second period, even if there's a rounding error somewhere. This should ensure that the first init segment fed to MediaSource is one that indicates encryption. Setting up EME should be enough, but Tizen TVs are full of bugs, so who knows.

Starting playback at time 40 didn't change the results.

I modified the test again to start at time 1490, close to the end. This way, we will only seek once, only play one period, and never play an ad period.

Playback continued successfully until the end of the content at 1505.

To determine if the difference is the ad period or the extra seek, I started playback toward the beginning of the final period, which is just over 7 minutes long. The start position was 1145.

The content played successfully for 30 seconds from 1145, then the test seeked to 1490 (15 seconds from the end). From there, it stalled again, without ever touching an ad period or any second content period.

(Of course, I'm assuming it didn't touch an ad period, based on the timestamps I fed it, but I haven't verified by watching the URLs it fetched.)

I tried this all with and without the stall detector. I also tried making the stall detector less "spammy" by having it wait for 5 seconds of stall instead of 1, and I tried making it more forceful by seeking 1.0 seconds instead of 0.1 when it detects a stall. None of that made a difference. Seeking by the stall detector was effectively ignored by the platform, and the playhead stayed where it was.

joeyparrish commented 4 years ago

If I play the content from 1425, and let it play until the end instead of seeking after 30s, it plays successfully, right through the region where it stalls in other test runs. Is it seeking that's the issue? Or the way we clear the buffers when we seek?

If I put return; at the top of seeked() in StreamingEngine, this prevents us from clearing the buffer when we seek. (Side-note: this triggered a null pointer exception in the compiler. No time to put together a report to google/closure-compiler right now, so I threw in --no-build on the test.py command-line.)

Playing again from 1145 again for 30s, then seeking to 1490 without clearing the buffers, the results are the same. But the buffered ranges show that the buffers were cleared anyway. So I had to comment out the code that calls mediaSourceEngine.remove to clear the buffer behind us as we play. This results in nothing new being buffered at all, which makes no sense. The old buffered range is there, but no new buffered range has been added.

Logs without clearing the buffer:

ERROR LOG: 'Timeout waiting for event ended', 'current time', 1171.044, 'duration', 1505.644, 'ready state', 4, 'playback rate', 1, 'paused', false, 'buffered', [Object{start: 1141.094, end: 1183.077}]
LOG LOG: 'Waiting for movement from 1490.644 to 1491.644'
WARN LOG: 'No variants met the ABR restrictions. Choosing a variant by lowest bandwidth.'
LOG LOG: 'Calling switch_(), bandwidth=14841 kbps'
LOG LOG: 'switch_'
WARN LOG: 'No variants met the ABR restrictions. Choosing a variant by lowest bandwidth.'
LOG LOG: 'Calling switch_(), bandwidth=4842 kbps'
LOG LOG: 'switch_'
WARN LOG: 'No variants met the ABR restrictions. Choosing a variant by lowest bandwidth.'
LOG LOG: 'Calling switch_(), bandwidth=5252 kbps'
LOG LOG: 'switch_'
ERROR LOG: 'Timeout waiting for movement from 1490.644 to 1491.644', 'current time', 1490.644, 'duration', 1505.644, 'ready state', 1, 'playback rate', 0, 'paused', false, 'buffered', [Object{start: 1141.094, end: 1393.077}]
INFO LOG: 'Preparing to destroy walker...'
joeyparrish commented 4 years ago

I'll resume this on Monday.

joeyparrish commented 4 years ago

I've verified that when we start playback at 1145, in the final period, the only init segments we fetch are from CBS, not from the ads. This confirms that when we seek to 1490 in the test, we haven't done anything that would count as a "switch", and we haven't transitioned quietly in or out of this period.

This leads me to believe that this problem might still be reproducible without ads. Another possibility is that some IMA-driven segment-splicing is a key component of the issue, and some aspect of the re-encoding process triggers a latent bug in Tizen.

I tested the theory that the Tizen bug might be that clearing the buffer somehow loses important info from the init segment. I modified StreamingEngine to make sure that any time the buffer is cleared, we mark the media state as needing another init segment. This makes Shaka re-fetch and re-append the init segment after seek, but it doesn't change the results at all.

I then tested the theory that the spot we seek to in the test is somehow special, but it doesn't seem to be. Instead of seeking to duration - 15 (about 1490), I tried seeking ahead by 5s or 15s. We only buffer ahead 10s in this test, so that means a buffered seek and an unbuffered seek. The buffered seek was successful, and seems to take about 1 second to complete. The unbuffered seek hung playback until the test timed out (20s).

joeyparrish commented 4 years ago

We maybe can't count IMA segment splicing among the triggers, since that would (I think) only affect the segments on the edge of an ad, and I have been testing playback starting after that point.

I tried modifying the test so that after playback hangs, we seek to time 5 in the first ad, which is unencrypted. This actually recovers playback somehow, even though we started playback in encrypted main content. After that, playback continues happily into the first content period, which is encrypted again. I made sure it goes well past any potential clear lead-in, and plays all the way into the next ad period at time 60.

I don't know what to make of this. I haven't been able to intuit from all of this information any underlying model of what's happening in the device or how to work around it.

joeyparrish commented 4 years ago

The behavior is the same with both key systems. I tried restricting it to PlayReady, and restricting it to Widevine. No difference.

Since seeking to clear seems to un-hang the system, I tried patching in a clear ad's init segment after seeking. Then it would append the correct init segment, then media segments. No change in behavior.

dsparacio commented 4 years ago

I tested the theory that the Tizen bug might be that clearing the buffer somehow loses important info from the init segment. ....... This makes Shaka re-fetch and re-append the init segment after seek, but it doesn't change the results at all.

I was wondering about this, in relation to what Dash.js does...Bummer that it does not solve.

I tried modifying the test so that after playback hangs, we seek to time 5 in the first ad, which is unencrypted. This actually recovers playback somehow, even though we started playback in encrypted main content.

We saw same type of thing, bit different. If you start at non zero in first or any encrytped content period , then seek back into pre-roll, the stream exits the first unencrypted period with no issue.

piyush2dev commented 4 years ago

@joeyparrish I tried playing DASH+WIDEVINE Content using shaka 3.0.1, 2.5.4 but I am getting this Error mentioned below . The Video is stalled. Can you please help me how to. resolve. it.

I am using Tizen 4 TV. (UA43NU7100)

Please help.

(audio:9) startup complete streaming_engine.js:1146 (all) setting up Period 0 streaming_engine.js:1140 (all) Period 0 is being or has been set up streaming_engine.js:1214 (all) Stream 9 is being or has been set up streamingengine.js:1214 (all) Stream 5 is being or has been set up player.js:4580 canSwitch simple_abrmanager.js:254 Calling switch(), bandwidth=9742 kbps player.js:4644 switch_ streaming_engine.js:803 switch: switching to Stream (video:1) streaming_engine.js:786 switch: Stream (audio:9) already active streaming_engine.js:1611 (video:1) looking up segment: presentationTime=3.7916666666666665 currentPeriod.startTime=0 playhead.js:504 Stall detected at 0 for 1.002000093460083 seconds. Seeking forward 0.1 seconds. streaming_engine.js:960 (all): seeked: buffered seek: presentationTime=0.1

I am attaching Assets details:

mpd URL -> https://ahatvhlsns.akamaized.net/aha-tv/Renditions/20200103/Pencil/dash/Pencil.mpd?hdnts=exp=1597944905~acl=/aha-tv/Renditions/20200103/Pencil/dash/*~hmac=df4ca0b211046cdb4701935f24cc8cb33ed8d5fa784a7e40d684741bc26b4a3d

License Token (CustomData)(X-AxDRM-Message) -> "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ2ZXJzaW9uIjoxLCJjb21fa2V5X2lkIjoiZDQ5YTEwZjktZmZjMy00ODhjLWI2NGEtYWIwNjAwYTE1NmJkIiwibWVzc2FnZSI6eyJ0eXBlIjoiZW50aXRsZW1lbnRfbWVzc2FnZSIsInZlcnNpb24iOjIsImxpY2Vuc2UiOnsic3RhcnRfZGF0ZXRpbWUiOiIyMDIwLTA3LTIwVDE3OjM1OjA1LjE2N1oiLCJleHBpcmF0aW9uX2RhdGV0aW1lIjoiMjAyMC0wNy0yN1QxNzozNTowNS4xNjdaIn0sImNvbnRlbnRfa2V5c19zb3VyY2UiOnsiaW5saW5lIjpbeyJpZCI6IjA2MDk2YjIyLTE1YTYtNDYzZi05NTFiLTZlNjg3Yjg3YzBjZiIsInVzYWdlX3BvbGljeSI6IlBvbGljeSBBIn1dfSwiY29udGVudF9rZXlfdXNhZ2VfcG9saWNpZXMiOlt7Im5hbWUiOiJQb2xpY3kgQSIsInBsYXlyZWFkeSI6eyJtaW5fZGV2aWNlX3NlY3VyaXR5X2xldmVsIjoyMDAwLCJwbGF5X2VuYWJsZXJzIjpbIjc4NjYyN0Q4LUMyQTYtNDRCRS04Rjg4LTA4QUUyNTVCMDFBNyJdfSwid2lkZXZpbmUiOnt9fV19fQ.weMvPHsaNHkC6iU2hgjKlZTaRPmRFDsoVKtw0v_P5ug"

License Url --> "https://0d45cc6b-drm-widevine-licensing.axprod.net/AcquireLicense"

@joeyparrish Above. asset has expiry for 1 week.

joeyparrish commented 4 years ago

@piyush2dev, please follow along for updates as we work on this issue. I'm not using your content specifically at the moment, but you may be experiencing the same issue as @dsparacio. As soon as we have a fix or tentative workaround for some part of the problem, we will retest everything and include your content in that.

I've been focused until now on the content from @dsparacio, but I now have reproduced the Tizen stalling behavior on clear, single-period content created by us for the demo. This eliminates IMA and encryption from the picture, which means the problem space just got simpler. The exact behavior I'm looking at (stall on seek in our automated asset test) may or may not turn out to be the same as what's reported by @dsparacio and @piyush2dev, but I'm hopeful that a fix for this will be relevant for both of you.

The content I'm using now is https://storage.googleapis.com/shaka-demo-assets/sintel/dash.mpd

dsparacio commented 4 years ago

@joeyparrish will check this asset in our player on failing TVs just to get more eyes on it. That is good news that we are in a more simple issue space.

joeyparrish commented 4 years ago

After seeking, we get bumped back to a keyframe (or segment start, not sure yet) by the platform. The time it takes for a successful buffered seek to complete is about 1.5s + distance to the segment start, implying that the platform is playing out the intervening content at 1x before the seek can complete. This would suggest that the stall detector is a terrible idea on such a platform, since it tries to "fix" a stall by seeking. But this doesn't fix anything at the moment. Even after disabling the problematic stall detector, the stall in playback never resolves itself.

joeyparrish commented 4 years ago

More things that don't help:

  1. Abort the SourceBuffers and reset all parameters after a seek
  2. Reappend the init segment after a seek
  3. Start appending media again from 0 after a seek
piyush2dev commented 4 years ago

@joeyparrish @dsparacio @avelad Thanks for your quick response. I have been trying. all the possible to play above asset shared but I failed. Is there a temporary way if you can. suggest that could play my above asset shared on a temporary basis.

Any help would be appreciated.

Sharing Asset detail again:

1 WEEK EXPIRY

mpd URL -> https://ahatvhlsns.akamaized.net/aha-tv/Renditions/20200103/Pencil/dash/Pencil.mpd?hdnts=exp=1597944905~acl=/aha-tv/Renditions/20200103/Pencil/dash/*~hmac=df4ca0b211046cdb4701935f24cc8cb33ed8d5fa784a7e40d684741bc26b4a3d

License Token (CustomData)(X-AxDRM-Message) -> "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ2ZXJzaW9uIjoxLCJjb21fa2V5X2lkIjoiZDQ5YTEwZjktZmZjMy00ODhjLWI2NGEtYWIwNjAwYTE1NmJkIiwibWVzc2FnZSI6eyJ0eXBlIjoiZW50aXRsZW1lbnRfbWVzc2FnZSIsInZlcnNpb24iOjIsImxpY2Vuc2UiOnsic3RhcnRfZGF0ZXRpbWUiOiIyMDIwLTA3LTIwVDE3OjM1OjA1LjE2N1oiLCJleHBpcmF0aW9uX2RhdGV0aW1lIjoiMjAyMC0wNy0yN1QxNzozNTowNS4xNjdaIn0sImNvbnRlbnRfa2V5c19zb3VyY2UiOnsiaW5saW5lIjpbeyJpZCI6IjA2MDk2YjIyLTE1YTYtNDYzZi05NTFiLTZlNjg3Yjg3YzBjZiIsInVzYWdlX3BvbGljeSI6IlBvbGljeSBBIn1dfSwiY29udGVudF9rZXlfdXNhZ2VfcG9saWNpZXMiOlt7Im5hbWUiOiJQb2xpY3kgQSIsInBsYXlyZWFkeSI6eyJtaW5fZGV2aWNlX3NlY3VyaXR5X2xldmVsIjoyMDAwLCJwbGF5X2VuYWJsZXJzIjpbIjc4NjYyN0Q4LUMyQTYtNDRCRS04Rjg4LTA4QUUyNTVCMDFBNyJdfSwid2lkZXZpbmUiOnt9fV19fQ.weMvPHsaNHkC6iU2hgjKlZTaRPmRFDsoVKtw0v_P5ug"

License Url --> "https://0d45cc6b-drm-widevine-licensing.axprod.net/AcquireLicense"

@vaage Is there a sample code you have how I can play it in maybe AVPlayer Tizen.

joeyparrish commented 4 years ago

@piyush2dev, we don't have a solution or workaround yet. There's nothing for you to do at this time. You can stop posting URLs and tagging people. Please be patient. As far as we can tell, this is caused by bugs in Tizen itself. We are trying to understand the bug and develop a workaround for it.

joeyparrish commented 4 years ago

Notes from the day's investigations. TL;DR: Good news, everyone! I may have a workaround!


I changed the test as follows:

  1. Play 30 seconds of unencrypted, single-period DASH content with no ads
  2. Seek forward to 15 seconds past the end of the buffered range (rather than close to the end, to keep the numbers simple) and round up to a segment boundary (since the platform seems to jump backward to a keyframe anyway)
  3. Don't clear the buffer on an unbuffered seek, and don't reset StreamingEngine internal state
  4. Make the stall detector sit around for 8 seconds before trying to seek (2 segment durations)
  5. Disable audio (to simplify buffering/streaming/decoding)

Playback still stalls out after the seek.

We buffer up to time 44 while playing to time 31. Then we seek to time 60.

Since we didn't clear the buffer or reset StreamingEngine state, it kept streaming linearly. Effectively, the playhead seeked, but not StreamingEngine. This means streaming and buffering continued without interruption from 0 to 72, while the stall occurred at time 60. It remained stalled out for 20 seconds before the test timed out.

Looking for possible recovery steps that could shed light on the hang, I added one more step to the test. After the stall, I seeked back to time 24, a segment boundary and a time that not only was buffered, but had already been played. It remained stalled out at 24 for 20 seconds before the test timed out.

I noticed that the timing of the "appended media segment" messages in the logs speeds up significantly after the seek. Not sure what that means. Maybe they aren't being decoded because they are behind the playhead?

Next I tried recovering by calling pause and play, with 1 second delays before each of them. This worked!

Next I tried making the test call pause & play immediately after seeking, rather than waiting for the stall/timeout, still with 1-second delays. This worked, too!

Next I tried calling play only, not pause, 1 second after seeking. This doesn't work, which makes some kind of sense, given that the state of the video element was paused: false at that time.

sequence result
seek, delay, pause, delay, play OK
seek, delay, play FAIL
seek, pause, delay, play OK
seek, pause, play FAIL

So the delay seems to be important. It's not yet clear how large a delay is necessary between pause and play. (It may be a thing that we have to poll state on before calling play.) I also have lots of other hacks in place, which now need to be unwound one by one to determine if any of them are helping, or if pause/play is doing the job all on its own. Finally, if this turns out to be the big winner, it's not clear when this workaround should be called. (On seek, on stall, just on Tizen, on all TVs, on all platforms, etc)

Calling video.pause(), then polling state on video.paused to call video.play()... sadly doesn't work. After calling video.pause(), video.paused is immediately true, which follows from the HTML video spec. But some hidden state deeper within Tizen does not seem to update immediately, so some hard-coded delay will be necessary. More work will be required to determine what that is.

delay result
1s OK
0s FAIL
500ms FAIL
800ms OK

After this, I noticed that one difference between the successful 800ms and the failing 500ms was the timing of appending the segments. At 500ms, the content at the target time was not yet buffered. At 800ms, it was buffered. So the workaround really only seems to help after content is appended to the position of the playhead. Perhaps instead of applying this workaround on seek, it should be applied by the stall detector, which only fires when the playhead is buffered. It could be a more effective stall recovery mechanism than seeking, particularly on TV platforms (which seem to need stall detection more than desktops anyway).


I reverted all of my other hacks, and I patched the stall detector so that when streaming.stallSkip is configured to 0, we call pause & play (with no delay) instead of seeking. Then I changed the default config to 0. The simple, unencrypted DASH content now passes the automated test!

Next steps:

  1. Test encrypted demo content on Tizen with the fix.
  2. Test content from @dsparacio and @piyush2dev on Tizen with the fix.
  3. Test all demo content on Tizen with and without the improved stall detector to prove that it is truly an improvement in all cases.
  4. Get the fix into code review.
  5. Ask @dsparacio and @piyush2dev to test the fix in their own applications.
joeyparrish commented 4 years ago

Next steps:

  1. Test encrypted demo content on Tizen with the fix.

One particular piece of our own encrypted demo content (sintel-widevine) fails with a decode error. Further digging reveals that this happens on some of our encrypted VP9 demo content on Tizen. Our device doesn't support opus, but does support VP9. For encrypted multicodec content, we end up selecting VP9 (lower bandwidth) + AAC (only supported audio codec in the content), and then it fails to decode.

Newer encrypted vp9 content works (such as angel-one-widevine), and h264-only content works (such as angel-one-widevine-hls), so I presume this is actually bad content and that a full re-encode of our demo content with current tools would fix this. Filed https://github.com/google/shaka-player/issues/2746

  1. Test content from @dsparacio and @piyush2dev on Tizen with the fix.

Content from @dsparacio is working with Widevine! Content from @dsparacio is working with PlayReady! Content from @piyush2dev is working!

  1. Test all demo content on Tizen with and without the improved stall detector to prove that it is truly an improvement in all cases.

I found some content which still does not play. The workaround is part of the stall detector, but the stall detector doesn't fire in the case of https://storage.googleapis.com/shaka-demo-assets/tos-ttml/dash.mpd . In that case, I see a bad interaction between the gap jumping controller and the platform's "jump back to key frame" behavior. We seek to 719.167, the platform jumps back to 716.002, but the buffered range starts 8 ms later. Since we're not in a buffered range (technically), the gap jumping controller inhibits the stall detector. (Long term, I'm starting to think we need to rewrite or maybe delete the gap jumping controller. It's a mess, and it's not clear that it's doing the right thing in many cases.)

Since this workaround is so critical for Tizen generally, a short-term solution might be to disentangle gap-jumping and stall detection. StallDetector already looks at buffered ranges (with fudge factors), so why should GapJumpingController be in charge of when to look for stalls? It seems like this was mainly to prevent the two from taking conflicting actions. But this could be easily solved by doing stall detection first, then allowing gap-jumping logic to proceed only when no action was taken by StallDetector.

This additional change (stall detector first, which suppresses gap-jumping, rather than vice-versa) allows the workaround to take effect on the "tos-ttml" content, which can then be played through the automated test.

dsparacio commented 4 years ago

@joeyparrish all of this is amazing we thank you for you extremely detailed information and hard work! We will test when you are ready. One note, we are on 2.5.11 now. We will upgrade to the version you need us to for this fix.


Just a side note... I just put in a patch in 2.5.11 for DRM Endianness in drm system for comcast Flex devices. LIke other tvs and platforms you filter out.... this is also needed.

@@ -1487,7 +1487,8 @@ shaka.media.DrmEngine.prototype.onKeyStatusesChange_ = function(event) {
         !shaka.util.Platform.isTizen() &&
         !shaka.util.Platform.isVideoFutur() &&
         !shaka.util.Platform.isWebOS() &&
-        !shaka.util.Platform.isTiVo()) {
+        !shaka.util.Platform.isTiVo() &&
+        !shaka.util.Platform.isWPE()) {
       // Read out some fields in little-endian:
       let dataView = new DataView(keyId);

We added simple check for comcast platform and then !shaka.util.Platform.isWPE(). in this statment. Seems to resolve our issue. We were getting manifest errors for keyID not matching what is in content due to the convert from big to little - Testing this now and will confirm with you in a new ticket etc. Maybe we can PR the fix?

joeyparrish commented 4 years ago

@dsparacio, sounds good to me. I'm cleaning up additional Tizen-related issues in our tests (test coverage for this platform was down until recently) to make sure my fix for the stall doesn't have any unintended side-effects, and then I should have everything in code review soon. So far, it looks like the right fix.

joeyparrish commented 4 years ago

Next steps:

  1. Test encrypted demo content on Tizen with the fix.

Done

  1. Test content from @dsparacio and @piyush2dev on Tizen with the fix.

Done

  1. Test all demo content on Tizen with and without the improved stall detector to prove that it is truly an improvement in all cases.

Done

  1. Get the fix into code review.

In review now, along with a few other Tizen-related fixes which will reference this issue

  1. Ask @dsparacio and @piyush2dev to test the fix in their own applications.

Coming soon!

joeyparrish commented 4 years ago
  1. Ask @dsparacio and @piyush2dev to test the fix in their own applications.

The fix is out in the master branch. Please test in your applications!

These fixes can be backported to v3.0.x and potentially to v2.5.x as well.

dsparacio commented 4 years ago

@joeyparrish !! Very nice, we will test this out first thing in morning then figure out which version etc. I have a fork for the other issue going so Ill review the PR and see if I can do a port etc. We are looking to upgrade to 3.0 as well. Keep you informed on this.

dsparacio commented 4 years ago

At first round of testing resolved on my system 100%. New issue on other model mentioned in ticket creation, but we are debugging this today to see what it really is. 500 errors on irdeto side so not a shaka issue. We just threw master at 3.1 into our player that was on 2.5.11 with no attention to details so maybe that is the issue or something else. We are slammed with another device issue so we will get info back to you EOD, but this seems to resolve it @joeyparrish thank you so much!

joeyparrish commented 4 years ago

Happy to help!

dsparacio commented 4 years ago

@joeyparrish :( So It is fixed on our models regarding Samsung except 2017 UN50MU6070 Fails 100% still.

However, XBOX it is failing 100% on all boxes. We tested both master version and 2.5.10, both fail exactly the same way as samsung did before you made changes.

Platform Info XBOX Win/10 Spartan (2014–2019 Edge) non chromium build of edge which is what XBOX dev offers at this time.

I am starting to gather logs for XBOX. I am not sure the one Samsung failure is our main focus at this point. Get back to here soon.

I tried to apply some of your Tizen fixs I saw in master for Edge as well to just test on XBOX. For instance this block

 // on one of these platforms.  Instead, default stallSkip to 0 to force the
    // stall detector to pause and play instead.
    if (shaka.util.Platform.isWebOS() ||
        shaka.util.Platform.isTizen() ||
        shaka.util.Platform.isEdge() ||
        shaka.util.Platform.isChromecast()) {
      streaming.stallSkip = 0;
    }

and the Endianness changes applied.... All of which made no difference for XBOX.

joeyparrish commented 4 years ago

So It is fixed on our models regarding Samsung except 2017 UN50MU6070 Fails 100% still. ... We tested both master version and 2.5.10 ...

I'm not sure if you made a typo in that version number, but please make sure to test with v2.5.14, or v3.0.2, which have the fix. v2.5.10 does not have the fix. Thanks!

Also, I just checked, and our Tizen device is model UN40MU6300. Hopefully this is representative enough of your UN50MU6070, which appears to be the same family and year.

As for XBox, we have an XBox One in our test lab, but we have no remote access to it at this time. Getting remote access to Tizen took some time, but I was essentially just revamping and repairing existing remote testing infrastructure for Tizen.

We have never had remote access to XB1, unfortunately. I have an idea for how it might work, but I'll have to be physically in the office again for an extended period to get started on that, and I don't think I will get to work on it any time soon.

If v2.5.14 doesn't work on XBox (please double-check!), is there an older version of Shaka Player that plays this content on XB1? Perhaps the previous release branch, such as v2.4.7? If so, we may be able to narrow down the cause somewhat without having to build the testing infrastructure first.

dsparacio commented 4 years ago

@joeyparrish sorry not clear, what I mean to say is we tested XB1 with that same master branch built locally as instructed, with the Tizen changes, to see if it helped with XBOX.

We also tested on XBOX with our unaltered production version where we first found this issue 20 days ago.
https://github.com/google/shaka-player/issues/2620#issuecomment-655829566

The reason I tested our current prod version again, is I added an Endianness change (mentioned the other day) that I added for comcast to see if it helped.

If v2.5.14 doesn't work on XBox (please double-check!), is there an older version of Shaka Player that plays this content on XB1? Perhaps the previous release branch, such as v2.4.7? If so, we may be able to narrow down the cause somewhat without having to build the testing infrastructure first.

We will certainly run more test with 2.5.14 version on XBOX and we can certainly do more testing with older version to help find if it works ther. I will get back to you on all of that.

In the meanwhile, should we re-open this ticket for XBOX or should I create a new one?

joeyparrish commented 4 years ago

Because most of the comments on this issue are focused on Tizen, and because we may not be able to work on XBox at this time, I think opening a separate ticket for XBox would make sense. It may help us keep track of things that way.

joeyparrish commented 4 years ago

But please let us know if v2.5.14 works on your 2017 Tizen device. It was playing in our tests of our own 2017 device.

dsparacio commented 4 years ago

Testing 2.5.14 on Samsung 2017 model and on XBOX for fun as well.

Then we will do XBOX on versions back to 2.4.7, 2.3.9, 2.2.10 and maybe 2.0.1.... all make sense. Let me know if going lower than 2.4.7 does not make sense in your mind for XB1?

I will now great a new ticket with all the XBOX info and from this ticket and add the results for XB1 in that ticket. I will add the result to Samsung 2.5.14 testing in this ticket.

Thanks for your help.

joeyparrish commented 4 years ago

If 2.3 or under is working on XB1, that's good to know. Finding the relevant change that far back will be much more challenging, but I'd rather know than not know. More information can't hurt, especially when gathering it firsthand is not immediately possible. Thanks!