Closed kodawah closed 7 years ago
The intent was to have a way to signal that the content is in fact stereo, but indicate that the actual information about which parts of the frame map to each eye is specified elsewhere. I explicitly referenced the mesh projection because that at least gives folks an idea of where to look for the pixel -> eye mapping. I could drop the last sentence if that would help any. Otherwise please provide some suggested text that would be generic enough to satisfy your objection.
I feel like we still need an enum value for this situation so that implementations can look at the stereo_mode in Matroska & MP4 to always determine if it is dealing with mono or stereo content. Hiding this in the projection private leads to the stereo_mode containing a lie in some form or another.
Hi Aaron, the intent is clear, but there could be some improvements on the semantics.
Namely I'd like to address these points:
Theoretically we could use a wording to make it more generic, such as "frame is stereoscopic but layout is implementation dependent" (even though it's not ideal), but first it should be determined whether adding a value to stereo-mode is appropriate or not.
Do you have an actual example for this value or a visual representation of this layout?
Hi, is there any progress on this issue? Since the change was introduced without any chance of commenting, a speedier approach and final agreement on this would be appreciated.
Sorry I haven't had time to get back to this.
Technically this isn't true even for the existing modes. Those modes are a subset of the H.264 frame packing arrangement SEI and must be consistent with that data. The point of this box is really to answer the question of whether the content is stereo and how UV coordinates should be modified on a per eye basis. I can drop the word "mesh" if that is a problem, but we still need a way to signal "This is stereo content AND UV coordinates should not be modified."
I disagree. This is just not historically true even for MPEG specs. This is a new use case and we do plan on using it. If you don't want to support it now, that is totally fine.
The layout is not fixed. It is completely dependent on the encoded mesh which is the whole point. That is mainly why I kept mesh in the name because it was intended to help implementations know where to look for the per-eye data.
I don't have an example to share at the moment. If it helps, you can imagine one use case where different meshes are used for each eye, but parts of those meshes use the same portion of the frame instead of disjoint sets. A very simple example of this is some of the 360 content out there where the poles contain a static logo image. You can reclaim some pixels if you share that image between eyes.
Again it is fine if you don't want to support cases like this, but we do plan on using this feature and we are just trying to disclose this fact publically. The change was made because we realized in our own internal implementations that a separate mode needed to be signalled for this case.
So I think I see the problem and I understand that you don't want to do a full disclosure of your business plans, but this is a shared public specification, and so it should adhere to certain formal procedures: the fact that this is a draft does not prevent it from being discussed at least privately by the companies that are adopting it. Moreover you can't dismiss the problem I outlined just by saying "if you don't want to use it, feel free to not support it" -- this is not how agreeing on a specification works.
Anyway, I can see the need of adding a stereo mode and I would even support it despite the fact that it's different than what MPEG/ISO specify, but what is really worrying is that it's violating the fact that the two boxes (st3d and sv3d) are independent of each other. This wasn't the case when the box layout was different in the past but the fact that it was agreed and accepted to keep the two separated should forbid any tentative of making them again interdependent. Theoretically you could have a st3d box applied to a video without any sv3d boxes, in order to mark a video as simply stereoscopic for example.
On that the point, this is actually true for the opposite direction: you can have a spherical video without a st3d box. Such a configuration would not mean that the video is monoscopic, but simply that the stereoscopic status is unknown. This would allow to leave the stereo status application-dependent, or for the case at hand projection-dependent, in a way that does not pollute other parts of the specification.
In other words, what I'm proposing is to revert stereo_mode
values to [0,1,2] only, and instead simply add a projection specific stereo field to the projection specific properties for the only projection where it can be applied (mesh). This would allow a clean description of the specifications, it would respect the independence of the boxes that we agreed upon, and it would streamline implementations of the mesh projections both for users that plan to support it and for those who don't.
What is your solution for Matroska? Lack of stereo_mode means mono and I was trying to keep MP4 & Matroska consistent.
Matroska is a slightly more malleable specification than MP4 (at least for stereoscopic videos, as there is a note "The 3D support is still in infancy and may evolve to support more features"). So my solution for Matroska would be to add a new value to the set which loosely conveys the fact that the video is stereoscopic, but its layout is application dependent and not part of that document. Alternatively we could add a plain "unknown" value, like it was done for DisplayUnit.
In either cases, when projection is mesh, mp4 would simply not include st3d and mkv would set StereoMode to this new value. Applications will then react and set stereoscopic mode by reading the rest of the projection specific details.
Why is it ok for mkv, but not MP4? The st3d box is essentially the StereoMode element and its presence in the spherical-video is just as malleable. It feels like you are suggesting adding more places to look for the truth about stereo-ness instead of just allowing one place to look.
I can understand the sensitivity to one box depending on another, but that happens on both specs and isn't unheard of. I'm also just trying to establish a common starting point for determining this information instead of implementations having to keep track of N places to look for this information. I know this is already a second location to look, but what you are proposing seems to advocate for continuing to allow this number to keep growing.
No, I think I didn't explain myself clear enough.
First, MP4 had no stereo3d specified, so a new box was introduced - the behaviour when the box is missing is left to the application, that can (for example) parse h264 fpa sei and enable stereoscopic rendering. MKV, on the other hand, already had one way to and it was reused, so we have to deal with backward compatibility (as expected), and we can't reuse the semantic that being unset (or 0) means 'unknown'. This is why it is ok for mkv, but not for mp4.
Second, I'm advocating to add one single value in mkv to an already quite large table, in order to cover all possible future cases. If in some time we want to add a new projection that needs a new stereoscopic layout for any given reason, then we would need to add a new value to both stereo_mode and mkv. This is unmaintainable in the long run, since this number is allowed to keep growing too.
Third, while I appreciate trying to cover two specifications (mp4 and mkv) with one, this should be done only if the specification integrity are not compromised, and we should not sacrifice the semantics of independence between st3d and sv3d for something that can be conveyed more simply in another way.
Finally, no, I'm not suggesting that we should add multiple places where to look for the the stereo-ness of a video, but rather I'm advocating that there should be a clean, idiomatic, and non-ambiguous way to convey it. The current proposal breaks this idiom by adding a projection-specific stereo mode and this is in my opinion unacceptable. So I would suggest adopting something that has already been done for other EMBLs (https://github.com/Matroska-Org/foundation-source/pull/17) and that would allow us to keep expanding projections and stereo modes independently of one another.
You feel marking the stereo mode "unknown" and then having an element elsewhere that makes the stereo-ness known is better than having a new value in a known place that tells you where to look for more detailed information?
Is the problem with the mesh projection specifically, any spherical projection, or any other source of extra stereo info? Would it be any more acceptable if the mode essentially meant "go look in the spherical projection metadata for extra info about the stereo mode"? This would at least avoid needing to add new values for each new projection that had special stereo concerns.
I appreciate your patience with me. I apologize for being a little combative above.
You feel marking the stereo mode "unknown" and then having an element elsewhere that makes the stereo-ness known is better than having a new value in a known place that tells you where to look for more detailed information?
Yes I believe so. Remember that st3d and sv3d should be independent of each other and not mention one another. Having an "unknown" value, or even better and "application dependent" value, would leave interpretation to be generic enough that other future users might safely adopt it. I'm not sure if you were involved in drafting back then, but I originally envisioned the case where you'd want to use this spec to tag MVC (or MVC-like) videos, in which you needed not to set st3d. Having an unknown/app-dependent value could work as well (if not ambiguously worded).
Is the problem with the mesh projection specifically, any spherical projection, or any other source of extra stereo info? Would it be any more acceptable if the mode essentially meant "go look in the spherical projection metadata for extra info about the stereo mode"? This would at least avoid needing to add new values for each new projection that had special stereo concerns.
The problem is overloading a value in fields destined for something else. So yes, any spherical projection mentioned in st3d (or any coded-dependent metadata) would be problematic in my opinion. I think that any solution that allows to keep the new value generic enough would be good, so I wouldn't phrase is as "go look in the spherical projection metadata for extra info about the stereo mode" (since this would again mention a spherical rendering), but rather "This stream is stereoscopic, but its layout is unknown", and let the application figure it out (eg. reading an fpa, the mesh projection private fields, or some future technology).
I appreciate your patience with me. I apologize for being a little combative above.
No problem, sorry for sounding a bit pedantic, I'm glad it's possible to talk about these issues.
@kodabb sorry I haven't been able to get back to this. I've had to focus on a bunch of other things. I'm coming around to your point of view. I'll try to send out a pull request in the next week or so.
Which would you prefer: Option A:
Option B:
Naming is hard.. so I'm also open to other suggestions on names. I'm slightly biased against "unknown" because we technically do have information about the layout so it is "known" it is just not one of the standard layouts.
Hi @acolwell.
I'm fine with either options, as long as layout is always frame packed, I would go with option B, while if the stereo layer is different (like for MVC) I would opt for option A. This solution would be similar to Matroska implementation, where two full frames of a stereo scene are signalled with a different unit than StereMode, which is reserved for anything that is frame packed.
Regarding names, I think custom
could work, I would describe it as "Indicates the video frame contains a stereoscopic view storing left and right eyes in the frame, but its layout is application dependent, and needs to be determined elsewhere" (or similar).
Possibly a tangential thing, but if MKV tags were used for metadata as a kind of secondary verification, there are some 'spacial' tags (sic) in here: https://matroska.org/technical/specs/tagging/index.html but as you can see, they're for location on a wider scale, and also it seems mainly for music. However, they could be co-opted anyway, (probably) without harm, by adding some specific agreed metadata to do this sort of thing. Maybe?
ping
Changes merged.
I just came across this old issue accidentally. MPEG-A specified the 'svmi' box in 23000-11 (Stereoscopic video application format) in 2009.
Unfortunately, not a free spec. https://www.sis.se/api/document/preview/913576/ - but GStreamer implements support for it (https://github.com/GStreamer/gst-plugins-good/blob/master/gst/isomp4/gstqtmux.c)
The
stereo_mode
box describes a generic way to pack views which bear no relationship to the fact that it is used to pack spherical videos. Additionally this is a property that only applies to a single projection type, which not only crosses boundaries, but it defies the point of having the st3d and sv3d separated.In my opinion this belongs to the projection private fields, but if there is absolute no argument to modify this, the wording should be improved to make it more generic.
cc @acolwell