w3c / encrypted-media

Encrypted Media Extensions
https://w3c.github.io/encrypted-media/
Other
180 stars 80 forks source link

Handling hardware context reset #473

Closed xhwang-chromium closed 3 years ago

xhwang-chromium commented 4 years ago

Problem

During a playback with hardware-backed media pipeline and CDM, if the device enters sleep or hibernate mode and then resumes, or in some cases when user switches monitors, the hardware context could be completely lost. In this case the playback cannot continue.

On different systems with different media pipeline and CDM implementations, the observable behavior to the JavaScript player could be very different. It could be a MediaError on the media element with some platform-specific MediaError.message, a zombie MediaKeySession on which all subsequent EME calls rejected, a closed MediaKeySession, or some combination of them.

This makes it hard for the JavaScript player to react in an interoperable way to provide a good user experience.

Proposal

Notes

The current EME spec already mentioned that the CDM can close a session at any point:

The CDM may close a session at any point, such as when the session is no longer needed or when system resources are lost. In that case, the Monitor for CDM Changes algorithm detects the change and runs this algorithm.

However, this is only in a NOTE which is often overlooked by CDM or JavaScript developers. We should explicitly document the logic around spontaneous session close and hardware context reset, e.g., via a dedicated section and/or with an example.

gregwfreedman commented 4 years ago

the current description for the closed attribute is:

Signals when the object becomes closed as a result of the Session Closed algorithm being run. This promise can only be fulfilled and is never rejected.

not only is there no mention here that the session can be closed by the CDM (that note is in the Session Closed section), but the name closed doesn't even suggest that this attribute is the only way to detect a fatal error (i.e. that the session was closed by CDM because "system resources are lost").

at a minimum, i think we should update the description, but i also support changing the return type to Promise<Reason>. or even adding an error event at the session level. if we are going to add a Reason, can CDM implementors help enumerate other reasons a session would be closed unexpectedly?

mounirlamouri commented 4 years ago

Adding to agenda to get opinions from the group.

joeyparrish commented 4 years ago

I'm in favor of this proposal, as well as clarifying the note in the spec and making it more clear that the CDM can close its own sessions without action being taken by the application.

Shaka Player would adopt the new Reason enum if implemented.

In addition to HARDWARE_CONTEXT_RESET, I believe a status like CLOSED_BY_APPLICATION would make sense if the application called close() on the session.

joeyparrish commented 3 years ago

Discussed again in the working group. We are ready to proceed with a PR to implement this proposal.

xhwang-chromium commented 3 years ago

To fill more details before we upload the PR:

enum MediaKeysSessionClosedReason {
    "unknown",
    "closed-by-application",
    "cdm-unavailable",
    "hardware-context-reset",
    "resource-evicted",
};

Notes:

Let me know if you have any thoughts or naming suggestions!

joeyparrish commented 3 years ago

Just discussed this with @xhwang-chromium. We decided that the "CDM unvailable" algorithm referred to above is not itself a reason, so we will strike cdm-unavailable from the list of reasons in the enum. The "CDM unavailable" algorithm may be updated to refer to the list of reasons and specify that one of them be associated with the sessions closed by the algorithm. The unknown enum value can stand in for anything not currently covered in the enum.

gregwfreedman commented 3 years ago

If the CDM crashes then perhaps cdm-error might be a little more informative than unknown. If there are any other reasons not covered, then unknown could still be used.

joeyparrish commented 3 years ago

@gregwfreedman, do you have any suggestion for how to describe the conditions for cdm-error in the spec?

xhwang-chromium commented 3 years ago

The cdm-error case @gregwfreedman described seems to be similar to cdm-unavailable proposed above. The current spec has texts like this:

If cdm is no longer usable for any reason then run the CDM Unavailable algorithm on media keys.

I would assume this covers the CDM crash case @gregwfreedman described. So "5.1.2 CDM Unavailable" could be updated to something like:

For each MediaKeySession created by the media keys that is not closed, queue a task to run the Session Closed algorithm on the session with reason cdm-unavailable.

Note that I am using cdm-unavailable for the sake of consistency with the existing spec. If we choose to use a different term, e.g. "CDM error", then we probably should update the algorithm described in 5.1.2 as well.

Thoughts?

joeyparrish commented 3 years ago

I think I agree with @xhwang-chromium's reasoning on this. A reason of cdm-unavailable when sessions are closed from the "CDM Unavailable" algorithm makes sense to me. If we want to change the name to cdm-error as suggested by @gregwfreedman, and use it in the same way, that's also fine with me. Both seem like reasonable names.

joeyparrish commented 3 years ago

As I worked on the PR, I found the I needed one more enum value for the case where a session is closed by acknowledgement of a record of license destruction (AKA license release).

gregwfreedman commented 3 years ago

I would be happy with either cdm-unavailable or cdm-error, although I think the latter is more intuitive. I also agree that a reason is needed for license release.

MediaKeyStatus enum has existing values to express similar concepts: internal-error & released. For consistency does it make sense to use the same names, or would that create more confusion?

gregwfreedman commented 3 years ago

As it stands, hardware-context-reset covers two distinctly different scenarios.

  1. Device hibernation
  2. Display configuration change

I think that an application would potentially handle those two cases differently. For device hibernation, an application might wait for user action to restart playback. For display configuration change (i.e. user dragging video to another display), an application might want to restart playback without. For this reason, I think we should consider two distinct values. Perhaps hardware-context-lost & hardware-context-changed.

@joeyparrish @xhwang-chromium what do you think?

xhwang-chromium commented 3 years ago

I see your point. But at least on Windows, we only get one HRESULT from Widows OS for both cases, and it'll be very complicated (if possible) if we try to differentiate them in Chromium. Given the assumption that display configuration change won't be a common use case, I'd suggest we stay with one enum value.

joeyparrish commented 3 years ago

I think we should consider two distinct values. Perhaps hardware-context-lost & hardware-context-changed

If all usage of hardware-context-reset requires new sessions to be created to continue playback, then I don't think it is necessarily the case that we should split it into -lost and -changed. And since there is not a reasonable way to differentiate in implementations, I suggest leave it as a single value.

If you want to rename reset to lost or something, I don't have a strong feeling about the difference between them.

MediaKeyStatus enum has existing values to express similar concepts: internal-error & released. For consistency does it make sense to use the same names, or would that create more confusion?

I'm not sure. Let's work out some details on cdm-unavailable / cdm-error first and come back to this question.

I would be happy with either cdm-unavailable or cdm-error, although I think the latter is more intuitive. I also agree that a reason is needed for license release.

cdm-error feels very much like the internal-error key status to me, which is to say that it's extremely vague. I think the proposal of the value unknown covers that same conceptual ground.

Looking again at the reasoning for the "CDM unavailable" algorithm and the corresponding cdm-unavailable enum value, it seems that it might not be the most sensible. For example, what are the practical cases in which a CDM becomes unavailable? Is a hardware context reset one of them? If so, specifying the use of cdm-unavailable would be wrong, and would leave no room for hardware-context-reset. Should the "CDM unavailable" algorithm allow implementations to decide what enum value to use based on unspecified internals?

@xhwang-chromium, I would appreciate your perspective on these questions as both an implementer and the person who requested this feature. :-)

joeyparrish commented 3 years ago

One more question: Should the "CDM unavailable" algorithm be used only for unrecoverable cases, and therefore be separate from hardware-context-reset? Should the corresponding cdm-unavailable / cdm-error reason be specified as such, where no further sessions can be created by the application?

xhwang-chromium commented 3 years ago

Here's my understanding:

One more question: Should the "CDM unavailable" algorithm be used only for unrecoverable cases, and therefore be separate from hardware-context-reset? Should the corresponding cdm-unavailable / cdm-error reason be specified as such, where no further sessions can be created by the application?

This sounds correct to me.

xhwang-chromium commented 3 years ago

One note on backward compatibility. This change should work in most cases. However, there are corner cases this could cause issue. For example, the follow code (usually seen in tests) would fail:

session.closed.then(function(result) {
  assert_equals(result, undefined);
})

I am not sure whether this is a concern in practice. I'll consult Blink owners on this topic. Please let me know if you have any thoughts or comments.

joeyparrish commented 3 years ago

Clearly the tests would need to be updated to match the new implementation/spec. It seems silly to verify that closed resolved with undefined in the first place, though. A better test would have been proving that it was resolved at all. (There is still an open bug in Chrome where that sometimes doesn't happen.) In general, any meaningful assertion that is checked on a callback needs to be backed up by a check that the callback itself actually occurred.

xhwang-chromium commented 3 years ago

Yeah, I am not worrying about the test. I just use it as an example how we can "break" sites. I double checked with Blink owners and this change is safe enough to proceed.

@joeyparrish @gregwfreedman Does it make sense for us to schedule a meeting to resolve the remaining issues? I am almost done with the implementation, and I'd like to have the right enums in the code.

gregwfreedman commented 3 years ago

I see two reasons for changing the closed attribute from Promise<void> to Promise<MediaKeySessionClosedReason>:

  1. The MediaKeySessionClosedReason is actionable by the application
  2. Telemetry for applications and implementations

@joeyparrish @xhwang-chromium With respect to hardware-context-resent, I don't think Microsoft's choice to use the same HRESULT for both cases should be the driving this decision. What is your expectation when a device resumes from hibernation/sleep? If video was playing, we teardown our player and just put play button on the last frame rendered and wait for user to resume. For moving a video to another display, this is a terrible user experience. While the latter is probably very unlikely, we have no telemetry to confirm this.

cdm-unavailable: This is to match the current spec around "CDM unavailable" algorithm which only runs when "the CDM is no longer usable". So I expect this to cover unrecoverable (or hard to recover, or better not try to recover) errors, e.g. CDM crash. When this is returned, the CDM is in a non-usable state, e.g. generateRequest() on a new session would just fail.

I don't think we should choose cdm-unavailable because it happens to be the name of the algorithm in the spec. I propose we just use error and not use cdm-unavailable or unknown.

@xhwang-chromium Based on you comment, if the CDM crashes and all MediaKeySessions are closed, an application cannot restart/recover the CDM by calling createSession on an existing MediaKeys instance? Can an application recover by creating a new MediaKeys instance?

xhwang-chromium commented 3 years ago

I see two reasons for changing the closed attribute from Promise<void> to Promise<MediaKeySessionClosedReason>:

  1. The MediaKeySessionClosedReason is actionable by the application
  2. Telemetry for applications and implementations

@joeyparrish @xhwang-chromium With respect to hardware-context-resent, I don't think Microsoft's choice to use the same HRESULT for both cases should be the driving this decision. What is your expectation when a device resumes from hibernation/sleep? If video was playing, we teardown our player and just put play button on the last frame rendered and wait for user to resume.

Nit: You could put the play button there without tearing down the player. Then when user clicks the play button, you can just create a new session and do a license exchange to resume playback, which hopefully would be a faster. But of course that's totally up to you.

For moving a video to another display, this is a terrible user experience. While the latter is probably very unlikely, we have no telemetry to confirm this.

I agreed with your reasoning. FWIW, when user drags the video to another monitor the "hardware-context" is also reset/lost. So probably we should have display-configuration-changed to be specific, such that on platforms (e.g. Windows) that don't differentiate these two we can just return hardware-context-reset; on platforms that do we can return hardware-context-reset and display-configuration-changed in corresponding cases.

cdm-unavailable: This is to match the current spec around "CDM unavailable" algorithm which only runs when "the CDM is no longer usable". So I expect this to cover unrecoverable (or hard to recover, or better not try to recover) errors, e.g. CDM crash. When this is returned, the CDM is in a non-usable state, e.g. generateRequest() on a new session would just fail.

I don't think we should choose cdm-unavailable because it happens to be the name of the algorithm in the spec. I propose we just use error and not use cdm-unavailable or unknown.

I am open to other naming.

@xhwang-chromium Based on you comment, if the CDM crashes and all MediaKeySessions are closed, an application cannot restart/recover the CDM by calling createSession on an existing MediaKeys instance? Can an application recover by creating a new MediaKeys instance?

In the current implementation, setMediaKey() on the existing media element will likely fail. We can look into improving that. But at the least, the application can always recover by restarting the playback (including creating a new media element and the mediaKeys object).

joeyparrish commented 3 years ago

Notes from our meeting:

  1. Differentiating between reasons for hardware context changes (display config change vs hibernation)

    • We are not aware of a platform today on which we would be able to easily implement a display-configuration-change enum value. For now, we will just use the more general hardware-context-reset value to cover any cases where the hardware decrypt or decode context has been reset and the application can continue by creating new sessions.
  2. Naming: cdm-unavailable vs cdm-error for unrecoverable errors

    • There's no need to name the application-visible enum value after the spec algorithm, and "unavailable" might be confusing. The key thing to communicate to the app developer is that this is an unrecoverable error. So we want error in the name. For consistency with the similar key status, we will use internal-error.
  3. Do we need the unknown value?

    • Since an application couldn't assume that an "unknown" case was recoverable, we may as well lump these unknown cases into the internal-error value, which applications should treat as unrecoverable.
  4. Naming: persistent-license-release

    • released is the key status that occurs right after remove(). The session closure occurs after the release message has been acknowledged by the server. After some debate between released and release-acknowledged, we settled on the name release-acknowledged, so that the name implies the order of events.
joeyparrish commented 3 years ago

I was able to update PR #487 already with the changes we discussed. Please take a look!

xhwang-chromium commented 3 years ago

For completeness, I am putting the "Security and Privacy self-review" here:

2.1 What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?

Information about certain device state changes will be exposed indirectly to Web sites, e.g. session closed due to "hardware context reset", which could be caused by using setting the device to sleep/resume, or switching monitors. Sites will not be able to know the exact reason. This exposure is necessary for sites to provide the best user experience.

2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?

Yes. It only expose an enum summarizing the reason.

2.3 How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?

No such info is exposed.

2.4 How do the features in your specification deal with sensitive information?

No sensitive information.

2.5 Do the features in your specification introduce new state for an origin that persists across browsing sessions?

No

2.6 Do the features in your specification expose information about the underlying platform to origins?

Currently "hardware context reset" only happens on Windows. So the site could guess it's an Windows OS if it happens.

2.7 Does this specification allow an origin to send data to the underlying platform?

No

2.8 Do features in this specification allow an origin access to sensors on a user’s device

No

2.9 What data do the features in this specification expose to an origin? Please also document what data is identical to data exposed by other features, in the same or different contexts.

It exposes an enum indicating the reason for EME session closures.

2.10 Do feautres in this specification enable new script execution/loading mechanisms?

No

2.11 Do features in this specification allow an origin to access other devices?

No

2.12 Do features in this specification allow an origin some measure of control over a user agent’s native UI?

No

2.13 What temporary identifiers do the feautures in this specification create or expose to the web?

No temporary identifiers

2.14 How does this specification distinguish between behavior in first-party and third-party contexts?

Not distinguished. But EME usage in general is controlled by permission policy.

2.15 How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?

No difference.

2.16 Does this specification have both "Security Considerations" and "Privacy Considerations" sections?

EME spec does have these sections. This particular feature doesn't.

2.17 Do features in your specification enable origins to downgrade default security protections?

No

2.18 What should this questionnaire have asked?

N/A