Choose one: media object attributes or media fragments URL

marisademeglio commented 3 years ago

This applies to the new draft

Right now, there are two ways of referring to a fragment of media:

<audio src="file.mp3" clipBegin="0" clipEnd="30"/> vs
<audio src="file.mp3#t=0,30"/>

Another example:

<video src="vid.mp3" panZoom="0,0,300,200" clipBegin="0" clipEnd="30"/> vs
<video src="vid.mp3#xywh=0,0,300,200&t=0,30"/>

Aside: for that last example, to specify that you want to have a pan and zoom transition to the spatial clip, you could add a parameter, as otherwise this information isn't conveyed:

<video src="vid.mp3#xywh=0,0,300,200&t=0,30">
    <param name="transition" value="panZoom"/>
</video>

It would be better to have one way of doing things, so we should choose to have either media object attributes or media fragments.

Let's list the pros and cons of each approach here.

Caveat: XML examples are used here, but technically, serialization format(s) have not been decided yet

marisademeglio commented 3 years ago

PRO Media fragments:

Compact syntax
Web standard that we can reference instead of creating more syntax to maintain ourselves

CON Media fragments:

Validation headaches

marisademeglio commented 3 years ago

PRO Media object attributes:

Easier parsing (https://lists.w3.org/Archives/Public/public-sync-media-pub/2020Oct/0001.html)

murata2makoto commented 3 years ago

I prefer media fragments URLs. Yes, application programs have to parse media fragments.

marisademeglio commented 3 years ago

@murata2makoto why do you think media fragments would be better for us? I want to keep track of the pros and cons.

murata2makoto commented 3 years ago

My two cents.

Media Fragments are defined by a W3C recommendation. It is a building block for handling audio and video resources. It appears to provide enough features for SyncMedia. If more and more applications start to use media fragments, users will have to learn only one syntax.

If we invent our own syntax, parsing might become slightly easier. But other applications that handle audio and video resources are unlikely to use our syntax.

marisademeglio commented 3 years ago

I like the idea of using a standardized syntax, and as far as I can tell, it covers our use cases. I don't mind making applications parse the values themselves, as from what I can tell, it's not that difficult.

Here's a media fragments parser for javascript.

I would like SyncMedia documents to be able to be validated declaratively. Does anyone know of media fragments validation schema for JSON or XML? Or any remotely similar use cases?

murata2makoto commented 3 years ago

@marisademeglio

I would like SyncMedia documents to be able to be validated declaratively. Does anyone know of media fragments validation schema for JSON or XML? Or any remotely similar use cases?

You might want to create a custom datatype library for RELAX NG. See https://relaxng.org/jclark/pluggable-datatypes.html

nigelmegitt commented 3 years ago

Big +1 from me to using option 1. See also the TTML2 <audio> element that uses the same approach. Then use API calls to play the correct section as needed. For example create the media element and then set mediaElement.currentTime (MDN) to make clipBegin work.

murata2makoto commented 3 years ago

@nigelmegitt

If relevant WGs of W3C agree to use the XML syntax, I have no objections. I'm just trying to make sure that we are not alone.

marisademeglio commented 3 years ago

Thanks @nigelmegitt . Yes that's how I've been doing it, even with media fragments experiments - set src and currentTime and then check the status.

Do you like the clipBegin/clipEnd approach better because it lightens the parsing burden? Or are there additional reasons?

Here is a situation where I really like media fragments -- imagine dual serializations (XML and JSON):

<par>
  <audio src="file.mp3#t=15,27"/>
  <text src="file.html#p01"/>
</par>

becomes

{
  "audio": "file.mp3#t=15,27",
  "text": "file.html#p01"
}

instead of

{
  "audio": {
    "src": "file.mp3",
    "clipBegin": "15",
    "clipEnd": "27"
  },
  "text": {
    "src": "file.html#p01"
  }
}

nigelmegitt commented 3 years ago

@marisademeglio I'm not overly worried by the parsing burden here. I'm more concerned about the right level of expressiveness of the data format.

By the way, the clipBegin and clipEnd approach can be applied to a src URI that includes media fragments: they are not mutually incompatible. The media fragment URI generates a resource like any other URI, and clipBegin and clipEnd can then be used to manage playback within that resource.

It's worth noting that servers may not be able to honour fragment URIs precisely, and in some cases the user agent could be able to do a better job. See the advice at Media Fragments URI 1.0 §7.4 for more details.

As an extra reference, SMIL 3.0 also uses the same syntax for media elements.

Optimising the spec at this stage for a smaller and less expressive JSON representation does not seem like the right order of priorities. It may well be that in the future other attributes of audio elements are desirable, either within the spec or outside it as extensions for specific implementations. In that case forcing the value of the "audio" key to be only a URI would be too limiting.

marisademeglio commented 3 years ago

@nigelmegitt so is it that media fragments can't express as many different types of values as clipBegin/clipEnd? E.g SMIL clipping allows SMPTE, npt, frames, subframes, etc?

I am not saying we should force all attributes into a single string. In fact, we have already introduced a new attribute on media objects, sync:track, that would not fit inside a media fragment. However, in the most basic and most common use case, where you just need to reference a file plus a time range, having a simple representation is appealing to users. Having a lightweight compact syntax that offers the same features as EPUB Media Overlays was one of the initial motivations of this CG, so while we've gone beyond that, I also don't want to forget about those expectations (whether we'll meet them or not, we'll see).

Thanks for the SMIL reference, it's good to mention it explicitly. I was an editor on SMIL3 and then later on EPUB Media Overlays, and this work is heavily based on both. SMIL's last revision was in a pre-HTML5 (and pre-media fragments) world, and it's interesting to think about how it would be different if it were revised today.

nigelmegitt commented 3 years ago

is it that media fragments can't express as many different types of values as clipBegin/clipEnd? E.g SMIL clipping allows SMPTE, npt, frames, subframes, etc?

That's an interesting point, that we could potentially take it further by using those, but it isn't what I was thinking of. Rather, the semantic intent is clearer to the processing code if clipBegin and clipEnd are explicit, and control is moved to the client.

If the only mechanism for partial playback of a media resource is media fragment URIs, then the client code has no knowledge of whether the server honoured the fragment request accurately: it just has a sample.

Possibly either approach or both together could all work. It may help to have a concrete use case. Let's say that there's a single audio file and we want to play different sections of it at different times.

Option 1: the client code has a single URI for the audio resource being referenced in different places, and different clipBegin and clipEnd values for each time it is played. The resource can be downloaded with one HTTP request and cached locally for playback of the different sections using API calls. Parsing is simple in that the begin and end times are already in their own attribute values. If content providers also want to use media fragment URIs, they can.

Option 2: the client code has a different URI for each section of the media, where the differences are the media fragment parts. The client code will most likely consider each one a distinct audio resource, and should make a separate HTTP request for each one. There's no client side caching, and if the server doesn't apply the media fragments correctly then the wrong parts of the media will play. It might be possible to do client side pre-processing of the URLs, parsing out the fragment times, and getting the full audio resource so it is processed like option 1, but I think this would be surprising and unwelcome behaviour from the content provider's perspective.

In summary, the semantics of the two options are not actually equivalent, though the result from a user perspective might appear to be the same, if all goes well.

By the way, has anyone looked into CDN behaviour for media fragment URIs? It might be relevant if they only cache against a complete URI including the fragment parts, since that will occupy multiple slots in the cache, one for each fragment of the same complete audio resource.

in the most basic and most common use case, where you just need to reference a file plus a time range, having a simple representation is appealing to users.

@marisademeglio sorry, I really think this is the wrong prioritisation here. My internal translation of this is "a lot of the time humans reading or typing a JSON file would like to avoid a few extra characters so we should do that even though it makes impossible some less basic or common use cases".

marisademeglio commented 3 years ago

@nigelmegitt thanks for the concrete use case! So the pros of using separate src/clipBegin/clipEnd you listed are - and let me know if I have this right:

easier parsing
one HTTP request
less room for server error regarding mistakes in handing media fragments

It might be possible to do client side pre-processing of the URLs, parsing out the fragment times, and getting the full audio resource so it is processed like option 1, but I think this would be surprising and unwelcome behaviour from the content provider's perspective.

I would expect a client to pre-process the URLs and request the resource once. Loading each media segment separately would probably result in a horrible user experience when applied to spoken narration. So, what you are saying is that a content provider that supports media fragment URIs is likely to require them and not work unless they were included? Or am I missing something - what then would be considered surprising and unwelcome?

That's a good question about CDN caching, unfortunately I don't have the answer myself.

A general question - what do you think of when you think of clients? I mostly think of special purpose user agents, such as today's EPUB reading systems.

murata2makoto commented 3 years ago

Audiobooks is now a Proposed Recommendation. It uses media fragments. Publishing@W3C should be consistent.

nigelmegitt commented 3 years ago

@murata2makoto I don't think there's anything inconsistent about allowing media fragments and also allowing clipBegin and clipEnd. Indeed it would be straightforward to add those to the Audiobooks spec if it turns out to be useful there.

@marisademeglio apologies for the delay. I would be extremely surprised if implementers were to pre-process URLs, unless the spec requires them to. We're talking about an optimisation, but if that optimisation is actually necessary for correct functioning, then it needs to be normatively specified. However that would mean that the spec is forcing the client implementation to get in the way between the content author and the server, and would mean that the author has a harder task to generate predictable URLs that will be requested from the server. It seems like a recipe for trouble in the future.

When I talk about clients I mean code running on a user agent that processes the document(s) and renders the output for the user. I don't think it matters whether it is a special purpose user agent or a polyfill on a web browser, for this debate.

murata2makoto commented 3 years ago

@nigelmegitt

I would argue that having two mechanisms for doing the same thing imposes unnecessary burdens on implementors and endangers interoperability.

nigelmegitt commented 3 years ago

@murata2makoto

They're not the same thing, as per https://github.com/w3c/sync-media-pub/issues/30#issuecomment-717120111 and
If it's true that implementers have to process audio resource URLs to avoid a bad user experience as per https://github.com/w3c/sync-media-pub/issues/30#issuecomment-717518246 then the choice is between implementing URL parsing and caching logic (relatively complex) and implementing clipBegin and clipEnd parsing and using existing API calls to play subsections of media (relatively simple).

marisademeglio commented 3 years ago

@nigelmegitt @murata2makoto If having a URI with media fragments implies fetching from a server a portion of a resource, then it does not seem suitable for fine-grained clipping, because of the caching complexity mentioned in this comment that would then be imposed on the client.

Using media fragments at a coarse level, such as to mark the beginning and end points of a whole chapter (which is what I see in the Audiobooks example) could very well make sense.

As mentioned above, Media Fragments can be combined with something like clipBegin, clipEnd, see this section.

dwsinger commented 3 years ago

Media-fragment syntax is formally owned by the media-type, so it can (and probably does) vary; what you can say on

a DASH manifest
an MP4 file
a SMIL file

probably varies in both syntax and capability.

w3c / sync-media-pub

Choose one: media object attributes or media fragments URL #30