Open padenot opened 3 years ago
Are you aware of any containers here that are not ISOBMFF? If not, AFAIK images are always stored muxed, only multi-frame content based on a multi-frame codec like HEVC, AV1, or H264 is stored demuxed -- but those are served by VideoDecoder.
So my take is that we probably don't need any changes to the ImageDecoder API, just a demuxing API and clients will use VideoDecoder for things that demuxed codecs and ImageDecoder for things that are still complete image files.
@baumanj, if you can maybe explain more precisely what you had in mind with this?
WebP uses RIFF, which is not ISOBMFF. Also, since there's lots of useful information that can be derived from containers without ever doing decoding, it seems useful to make a separation to allow for greater forward compatibility. Add that to the fact that IP restrictions make formats like HEIC undecodable on many platforms, but the same container interpretation code could give access to useful metadata. Finally, I don't assume that ISOBMFF, which is fairly complex, not-free and heavyweight will be the only container any new formats ever use, so separating the ability to decode the image data one may store within it today would increase flexibility and innovation in the future.
To me that sounds like you're just advocating for a containers API, which I agree is a nice value add. However, I can't think any concrete changes that we'd make to the ImageDecoder API which would help this use case. Can you give some examples of what you're thinking about?
I.e., in a world with a browser provided RIFF, ISOBMFF, $future_format demuxer that vends demuxed samples, what would we add to the ImageDecoder API that's not there already? Since we take a raw array buffer + mime type, I think we can decode whatever may come if it's appropriate to home it on ImageDecoder instead of say VideoDecoder.
I'm not suggesting this needs to be part of the ImageDecoder API itself, but rather that implicitly including demuxing as a black box inside the decoder API makes it less useful and forward-looking. Wouldn't it be better to keep the decoder API smaller and simpler and break out the demuxing concerns to an API which is appropriately focused on the details relevant to containers?
For one, the ability to get metadata about the image(s) without doing the decode seems like a natural one. Given containers with the ability to have multiple images encoded at different sizes, color spaces, etc., there are many questions that could be asked of a container that I don't think it makes sense to add to a decoder interface which is more appropriately streamlined to f(coded image) → pixel data. Also, the same container could conceivably contain multiple coded image types only some of which the platform has decoder support for. How would the Image Decoding API handle that?
Images and their containers are highly coupled in the by-far most common cases (GIF, JPEG, etc), so anything that separates those processes seems bound to incur more complexity and not less. I don't think it makes sense in the common case to force authors to go through a separate demuxing and decoding phase for images. Aside from the complexity this feels like a performance issue in the single image case. I talk about this in the explainer.
I 100% agree that a containers API and metadata extraction is very useful and something we should consider going forward, but I think it's orthogonal to what we have in the ImageDecoder API. I'm absolutely on board with exposing what metadata makes sense through the tracks API that ImageDecoder has. In your example we'd only expose the tracks the platform can decode.
Images and their containers are highly coupled in the by-far most common cases
I definitely agree that is the case with the legacy formats, but since WebP, it looks like things are moving in a different direction. All the formats developed in the past decade likely to see broad use that I'm aware of (HEIC, AVIF, JPEG-XL, etc.) decouple the container from the codec. Also, it looks like images are tending to become more video-like in their development, so I would expect to see more of the mixing and matching of container and codec that's so common in that space.
I don't think it makes sense in the common case to force authors to go through a separate demuxing and decoding phase for images
I agree. There should definitely be a simple path for the common case of inputting the whole container and getting out something renderable. I'm just advocating for the inclusion of container-level awareness as a first class abstraction here since I believe it will smooth adoption of new formats.
It sounds like you're just in favor of issue #24 then. Do you have any concrete proposals for how we should change ImageDecoder? If not I think we should close this issue and move discussion to #24.
Per https://github.com/w3c/webcodecs/issues/205#issuecomment-829572333 I think ImageDecoder can already do everything we'd want to do in a post-containers API world and VideoDecoder cover any gaps in the demuxed packets case.
Do you have any concrete proposals for how we should change ImageDecoder?
I think the addition of a metadata query would be very useful. That wouldn't strictly be part of a container API since it's also germane to JPEG, PNG and other highly coupled formats like you mentioned, but going forward, that should almost certainly be a container-level operation to extract metadata like dimensions, bitdepth, colorspace, existence of alpha, exif data, etc.
VideoDecoder cover any gaps in the demuxed packets case
Can you elaborate?
I think the addition of a metadata query would be very . That wouldn't strictly be part of a container API since it's also germane to JPEG, PNG and other highly coupled formats like you mentioned, but going forward, that should almost certainly be a container-level operation to extract metadata like dimensions, bitdepth, colorspace, existence of alpha, exif data, etc.
As mentioned above. I'm in total agreement on adding such metadata, I designed the ImageTrack interface for such things. Today it's just frame count and other simple data, but I envision it to hold all the things you're talking about and more. Metadata is decoded automatically in the current spec language. It is indeed a separate step from decoding.
So apologies, but I'm still confused on what your request is for the current API shape. Are you just suggesting those metadata fields? That seems a minor addition and less of a fundamental shape thing. Can you elaborate more?
VideoDecoder cover any gaps in the demuxed packets case
Can you elaborate?
I.e., in some future world where we have a WebContainers API (which we're all in favor of, just maybe not as part of WebCodecs or at least not in v1), you'd pass a bytestream to said container API and after some track selection, in addition to metadata, you'd get demuxed packets of codec XYZ. If that codec happens to be a video one, you can use the VideoDecoder API as it stands today. If it's an image one, there's no reason the current API shape can't accept those as a ReadableStream of bytes or typed chunks.
So apologies, but I'm still confused on what your request is for the current API shape. Are you just suggesting those metadata fields? That seems a minor addition and less of a fundamental shape thing. Can you elaborate more?
I agree the shape is reasonable (providing metadata via the ImageTrack
interface), but I didn't realize that was the intention because it doesn't have most of the kind of metadata I'd assumed would be included (dimensions, color space, transforms, etc.). Is there a reason that can't be included now?
Sorry that wasn't clearer! The only reason is that we were trying to be conservative in what was exposed currently. We dropped exif rotation for now since we couldn't agree on the best way to expose it. Dimensions seems like an easy and non-controversial one to add now. Are there any others you'd prefer would be in WebCodecs v1?
Color space and transforms will need to wait until we figure out the right language for describing them, which may not be a part of v1 since we'll probably need to consider interplay between the new canvas color spaces and such.
Triage note: tentatively marking 'extension' as recent discussion proposes new attributes/metadata.
We dropped exif rotation for now since we couldn't agree on the best way to expose it.
Because <img>
does it now, this needs to be figured out, otherwise we can't reimplement <img>
with, say, ImageDecoder
and canvas
.
And even without considering that, it's not great to not be able to draw images in the right orientation...
Sorry to be clear, we just dropped a public accessor for the exif rotation code metadata -- orientation works correctly. It's all handled under the hood (there are extensive WPT for this) just like it is for img.
Another thing that occurs to me that maybe represents an actual difference to API shape. If ImageBufferSource
can only be a containerized image (for formats which have containers), I worry that this will discourage innovation. We already have formats that can be used in various container contexts, and providing a decode-only interface would allow a consumer of this API to deal with the container details themselves instead of waiting for them to be implemented by the browser. I'm pretty sure CDNs are going to be interested in ways to slim down images into a format which is more minimalistic given that the defaults tend to prioritize flexibility over minimizing the byte overhead.
How would people feel about having an interface that can take a raw coded frame w/o metadata (other that what the codec defines) in addition to the convenience interface that can be passed the entire containerized image?
That's what I was referring to with this above:
If that codec happens to be a video one, you can use the VideoDecoder API as it stands today. If it's an image one, there's no reason the current API shape can't accept those as a ReadableStream of bytes or typed chunks.
ReadableStreams are dynamically typed, so we can always allow a stream of chunks later on combined with a mime type requirement. I don't think we need this quite yet, but it wouldn't be breaking to add at any point in the future.
How would people feel about having an interface that can take a raw coded frame w/o metadata
We have an API for decoding raw frames, VideoDecoder
. The problem is that advanced image formats don't have standardized raw formats, so we can't easily specify how you would ask VideoDecoder
to do that work.
That's not in scope for WebCodecs V1, and I doubt that inventing bespoke formats is ever going to be in-scope for WebCodecs.
In cases where there is a standardized raw format, it would make sense for UAs to implement support for them. Whether those make more sense in ImageDecoder
vs VideoDecoder
would depend on the individual formats.
We have an API for decoding raw frames,
VideoDecoder
. The problem is that advanced image formats don't have standardized raw formats, so we can't easily specify how you would askVideoDecoder
to do that work.
AVIF is based on AV1, which has a standardized, free-to-use, publicly available format, right?
That's not in scope for WebCodecs V1, and I doubt that inventing bespoke formats is ever going to be in-scope for WebCodecs.
I'm not clear what you mean by "bespoke" in this context. If you mean whatever a random individual creates, I agree that's not an important use case, but if you mean something that is early in the standards development process which hasn't had time to be implemented by all browsers, allowing early adopters to provide container interpretation while still leaning on a standard API to handle the decoding seems like a boon to innovation.
In cases where there is a standardized raw format, it would make sense for UAs to implement support for them. Whether those make more sense in
ImageDecoder
vsVideoDecoder
would depend on the individual formats.
I'm not really clear whether you support adding an interface to decode raw (that is, demuxed) coded data to the image decoding API here or not.
AVIF is based on AV1, which has a standardized, free-to-use, publicly available format, right?
Yes, and AV1 is directly supported by VideoDecoder
.
I'm not clear what you mean by "bespoke" in this context.
An example here would be PNG or JPEG. These formats are tightly coupled to their containers, so it's not clear what a "raw", uncontainered version of these would be. We could invent our own, which would be "bespoke".
I'm not really clear whether you support adding an interface to decode raw (that is, demuxed) coded data to the image decoding API here or not.
To be clear: We believe such data should be processed by the VideoDecoder API, it's properly designed to handle all the intricacies of demuxed (implying configuration is separate -- a crucial detail) coded data. However we're not opposed to extending ImageDecoder to take a ReadableStream of EncodedVideoChunks for formats the user agent accepts in <img> if a strong enough use case is presented.
That said, I think our conversation is meandering quite a bit - to the point that I'm not really sure what we're discussing anymore. @baumanj can you please provide a concrete list of your requests? As far as I can tell, it seems you have two:
For the first I haven't heard any reasons why the VideoDecoder API is insufficient. For the second, we should split into individual issues for each piece of metadata you would like to add.
I'm not clear what you mean by "bespoke" in this context.
An example here would be PNG or JPEG. These formats are tightly coupled to their containers, so it's not clear what a "raw", uncontainered version of these would be. We could invent our own, which would be "bespoke".
For containerless formats like PNG or JPEG, I'd say the demuxing operation is a noop, and the same content should be accepted as inputs to a theoretical "raw" input mechanism for the ImageDecoder API. Is there a downside to that?
That said, I think our conversation is meandering quite a bit - to the point that I'm not really sure what we're discussing anymore. @baumanj can you please provide a concrete list of your requests? As far as I can tell, it seems you have two:
* Accepting demuxed data in ImageDecoder. * Adding more metadata to the ImageTrack.
I think that's a fair summary. Thanks for refocusing.
For the first I haven't heard any reasons why the VideoDecoder API is insufficient.
Implementation-wise, I expect the same underlying decoder libraries to be used for both ImageDecoder and VideoDecoder where appropriate, but I do not think it's appropriate for VideoDecoder to have full responsibility to decoding raw image data. The fact that several major recent still image formats are based on video codecs is a coincidence, not a fundamental property that should drive the shape of the API, I don't think. Would JPEG-XL be supported by VideoDecoder? Since we're talking about images, what's the downside of providing a facility within ImageDecoder for handling demuxed input?
For the second, we should split into individual issues for each piece of metadata you would like to add.
Sounds good
For containerless formats like PNG or JPEG, I'd say the demuxing operation is a noop, and the same content should be accepted as inputs to a theoretical "raw" input mechanism for the ImageDecoder API. Is there a downside to that?
I don't follow, ImageDecoder
can already decode PNG and JPEG, so the existing API is already "raw" here.
The fact that several major recent still image formats are based on video codecs is a coincidence
I'd say we are seeing a bifurcation in image formats that is likely to continue in the future. These codec-based formats have features that align well with VideoDecoder
, while non-codec formats align better with ImageDecoder
.
What is a coincidence is that none of the non-codec formats can be meaningfully demuxed, but without an example of something different I don't think we're ready to propose an API for it.
Would JPEG-XL be supported by VideoDecoder?
I think JPEG XL falls squarely within the ImageDecoder
feature set using the current API.
There are a number of container formats that can hold a variety of different codecs’ coded data as well as the possibility of the same coded data appearing in a variety of containers.
If a demuxing API is later considered (as it's not infrequently brought up during calls and discussions with developers, https://github.com/w3c/webcodecs/issues/24 for tracking, certainly less urgent than decoders!), this could be handled by desugaring the image decoding API maybe ?