Native GLTF - Githubissues

sushraja-msft commented 4 years ago

This proposal is about adding built-in GLTF support in the browser.

At the W3C workshop on web games, Yasushi Ando conducted a breakout session discussing native support for GLTF in the browser. His position statement can be read here https://www.w3.org/2018/12/games-workshop/papers/new-html-3D-element.txt, proposes a scene element that can render a gltf/glb model.

<scene controls vrenabled width="300">
      <source src="http://example.com/Monster_small.glb" type="model/gltf-binary" media="(min-width: 320px)">
      <source src="http://example.com/Monster.gltf" type="model/gltf+json" media="(min-width: 640px)">
      Message for unsupported browsers
</scene>

The GLTF would be exposed to JavaScript allowing manipulation of the scene graph, transform or triggering animations. I took a follow up from the workshop to post in this community group to discuss value proposition and shape of the API.

Adding native support for GLTF is appealing and can add value in the following ways

Performance A single GLTF model file can describe an entire scene apart from standalone 3d objects. Traversing, rendering such a scene graph in native code allows fulling utilizing the hardware capabilities (threads, graphics API) to have performant rendering. Performance improvements allows for higher fps or richer scenes.

Model Protection Protecting 3d assets was a developer concern during the workshop, native support for GLTF can allow for solutions like EME to protect assets.

Independent Rendering Describing the 3d scene to the browser instead of the current immediate mode WebGL style API, allows the browser to render the scene independent of the JavaScript frame rate. In a XR use case, a scene can be re-rendered to the new headset orientation independent of script performance.

Foreign Object Similar in concept to foreign objects in SVG, in the future we can allow embedding HTML content within 3D scenes as a texture source for 3d objects in the scene.

Hit Testing The browser can offer hit testing of the 3D scene in a performant manner. Combined with Foreign Object because the browser understands the scene, it can offer hit testing down to the HTML elements. Caveat - supporting hit test into HTML content will require further spec’ing input to prevent feature misuse for click-baiting.

Dev Tools Integration Like HTML, browser dev tools (F12) integration can provide value to developers. Ability to debug a scene, see properties on nodes, view performance metrics are examples of data surface through dev tools.

3D IFrames Far out in the future, when 3d content is ubiquitous - we could want a section of 3d space in AR/VR to be owned by a domain1.com and another section to be domain2.com. This is a similar model to Iframes in 2d, having the browser understand the scene emitted by each domain is a steppingstone to building such a metaverse.

Prior Art

Apple supports viewing usdz models through anchor tags https://webkit.org/blog/8421/viewing-augmented-reality-assets-in-safari-for-ios/

<a rel="ar" href="model.usdz">
    <img src="model-preview.jpg">
</a>

Announced during Google IO 2019, Google’s approach is through model viewer js library which adds a custom element the <model-viewer>. https://developers.google.com/web/updates/2019/02/model-viewer#basic_3d_models Later the src attribute can be set to a gltf file and the javascript renders the model for you. <model-viewer src="assets/Astronaut.gltf" alt="A 3D model of an astronaut">

API Shape

This proposal adds WebGL interop to enable a broader use case where WebGL can add post render effects, particles or text sourced from canvas elements. There are at least 3 concepts here when it comes to native 3d model support.

GltfModel This is the component / class responsible for loading a GLTF file and loading up resources to be consumed in a format agnostic way.

Setting GltfModel.Src = http://modelfile.gltf or glb would download , parse turn buffer views into a texture that WebGL understands and then will construct the Scene/SceneNodes. We would have a loaded event at the end of the whole process.

GltfModel.Scene then points to the loaded GLTF scene.

Scene/SceneNodes This is the way we represent the 3d content from the GLTF file in a retained fashion – it is the object model for GLTF/3d. JavaScript can act on these objects and change transform and properties on them.

Splitting the loader from the Scene will insulate us from any model format changes in the future.

Viewers

Now that we have a representation of the scene as Scene/SceneNodes there can be viewers for this scene. Three of these are possible

WebGL Scene.draw(WebGLRenderingContext context). This would take the vertex buffer and transforms represented in the scene graph and draw that with the WebGL context. Almost the same steps that a WebGL renderer would go through but in native code behind a single JavaScript call.
WebGPU Like WebGL in the future we should add some way to render a Scene with SceneNodes to webgpu workflow.
SceneElement <scene></scene> This is the final one that is an actual html element. This is the home for adding capabilities beyond what WebGL can offer. Here is where independent rendering will show up, here is where with a single attribute, the developer can ask the browser to render the same scene to an XR device.

Example use cases

Focusing on the WebGL view, here is how a developer would use it to render a spinning model. Modifying the spinning cube sample from https://www.tutorialspoint.com/webgl/webgl_cube_rotation.htm

<!doctype html>
<html>
   <body>
      <canvas width = "570" height = "570" id = "my_Canvas"></canvas>
      <script>
         var canvas = document.getElementById('my_Canvas');
         gl = canvas.getContext('webgl');
         function get_projection(angle, a, zMin, zMax) {
            var ang = Math.tan((angle*.5)*Math.PI/180);//angle*.5
            return [
               0.5/ang, 0 , 0, 0,
               0, 0.5*a/ang, 0, 0,
               0, 0, -(zMax+zMin)/(zMax-zMin), -1,
               0, 0, (-2*zMax*zMin)/(zMax-zMin), 0 
            ];
         }
        var proj_matrix = new DOMMatrix(get_projection(40, 
canvas.width/canvas.height, 1, 100));

         /*===============Instantiate GLTF Model =================*/
        const model = new GltfModel('https://github.com/KhronosGroup/
glTF-Sample-Models/blob/master/2.0/BarramundiFish/
glTF-Binary/BarramundiFish.glb');

         var time_old = 0;
         var animate = function(time) {
            var dt = time-time_old;
            /*===========Update model transform each frame ==========*/
            model.scene.children[0].localTransform.rotateSelf(
dt*0.003, dt*0.002, dt*0.005);
            time_old = time;

            gl.enable(gl.DEPTH_TEST);
            gl.depthFunc(gl.LEQUAL);
            gl.clearColor(0.5, 0.5, 0.5, 0.9);
            gl.clearDepth(1.0);
            gl.viewport(0.0, 0.0, canvas.width, canvas.height);
            gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
            /*================= Draw GLTF model ================*/
            model.scene.draw(gl);
            window.requestAnimationFrame(animate);
         }

         /*======= Wait for model download / parse to complete =========*/
         model.load().then(() => {
            model.scene.activeCameraNode.projectionMatrix = proj_matrix;
            animate(0);
        }).catch((UnknownError) => {
            alert("Failed to load Model !");
        })
      </script>
   </body>
</html>

Next Steps WebGL is a stateful API and there are details to work through in this proposal. The purpose of this post is to gather some consensus on Why and collect feedback on How we surface GLTF support in the browser. Hope to have a conversation around this.

technohippy commented 4 years ago

Thank you for registering the issue. In the issue, there are many interesting ideas I've never thought. Among them, the most favorite one is 3D IFrames. It reminds me of the croquet project.

By the way, introducing this GltfModel seems a bit over the top for me. IMHO, the scene element should focus only on a simple use case. For an advanced use case the WebGL (or WebGPU) may be suitable.

Of course, the boundary between simple and complex varies among individuals. I'd like to know others' thought.

blairmacintyre commented 4 years ago

When we'd chatted about things like this in the past, some questions have come up:

how can we guarantee this will actually perform better in any given browser than in-browser rendering, especially as webgpu, wasm, shared array buffers, etc., become common. Would someone be willing to create and completely open-source a highly optimized and performant renderer that could be included in any browser? Especially given differences across platforms
similarly, what about the visual fidelity. How do we standardize things so they look the same on all platforms, so user's get what they expect
just "rendering" via a single call is fine, especially if it's the only thing being rendered. But that seems unlikely, and the example source give seems too simple for almost any realistic use case. As soon as there is more than one model, how do we do other things: physics, shadows, etc. Without having the model in javascript, none of these are possible.

sushraja-msft commented 4 years ago

@technohippy - thank you for sharing your views. In the proposal I hope it was clear that <scene> will exist, the idea behind supporting webgl interop was to not lock this capability to render a model behind a HtmlElement and keep it more flexible. Similar to the image element that both acts as a viewer and lends its capability as a image format decoder (when used as a parameter for texImage2D to create webgl textures).

Like you, I hope to hear more developer feedback around preference for vs + webgl interop and reach consensus.

sushraja-msft commented 4 years ago

@blairmacintyre thanks for bringing up these concerns

I agree we should at least measure performance against existing libraries but it could be with a basic native render er with a suitable model rather than a full fledged implementation from the get go. I hope to do some prototyping in this space to learn more, stay tuned.

Visual fidelity would be achieved the same way other open standards have evolved. This would be in the domain of the GTLF specification, rigorous conformance tests and spec updates to ensure compatible rendering.

To your last question the model is exposed to javascript. You would implement physics using a JS/Web Assembly library that updates transforms on the model.scene.children[0].transform after running its simulation.

A single GLTF model file can represent entire scenes, and so shadows can either work through native GLTF supporting extensions for lighting https://github.com/KhronosGroup/glTF/tree/master/extensions/2.0/Khronos/KHR_lights_punctual or perhaps webgl can use the single call "rendering" to render a depth map from the lights point of view to generate a shadow map or perform deferred lighting - I concede I need to learn more about shadow rendering techniques to comment further. At the moment, I am trying to see if this idea of adding GLTF support to the browser (perhaps even just as a <scene> element) resonates with the web community before investigating further.

avaer commented 4 years ago

I kind of agree with Blair here; GLTF rendering is already done by WebGL frameworks in opinionated ways that make sense for them. It's not clear how having the browser handle the render would make performance better, and my gut says it wouldn't. The rest of the use cases can already be done today without a native GLTF feature.

An alternative I see is just having a native loader for GLTF that deals with laying out the file into JS memory.

Loading is easy to get wrong or slow, requires extra user code, and there isn't much reason to parse GLTF in a bespoke fashion (much like how we deal with other media elements). Such a native loader would solve some problems without opening the can of worms of declaring blessed rendering passes at the browser level.

cdata commented 4 years ago

Hey there, maintainer of the <model-viewer> project here. I'm very excited to see this discussion happening in this group 😄

I wanted to bring attention to a long-sought-after capability on the web in the context of this proposal: the ability to compose traditional "2D" DOM content into 3D scenes. The fundamental capability required to mix 2D and 3D content currently does not exist in the web platform, and most hacks to approximate such a capability are impractical.

In the past, there was some effort to realize this capability in the form of some kind of DOM-to-texture API. However, these proposals were rejected, in part because WebGL's capabilities would give an attacker the ability to infer secure content with timing attacks.

A high-level and/or declarative 3D API might enable composition of 2D and 3D content in ways that are both safe and coherent with the existing DOM style, layout and interaction systems.

There are a lot of great questions that would need to be answered, many of which we are exploring in the context of the <model-viewer> project. Rendering fidelity, scene graph representation, styling, interactivity... there is a lot to mull over for sure 😁

Finally, I would like to offer that 2D / 3D composition may ultimately be considered an orthogonal proposal to this one. Be that as it may, I'm sharing this use case because to me it implies that a native integration with glTF ought to manifest as something higher-level than a loader that interacts with WebGL.

sushraja-msft commented 4 years ago

Thanks @modulesio and @cdata for sharing your views 😄. "WebGL's capabilities would give an attacker the ability to infer secure content with timing attacks." - thanks for bringing up this attack vector, I am assuming this is about form fill data, visited links that would leak. Adds further weight behind a scene element approach in order to not have feature fragmentation.

blairmacintyre commented 4 years ago

One thing I want to clarify: I asked questions, I didn't say "This is a bad idea because of these things." As I said, I've had these discussions before, sometimes as the one suggesting it, sometimes as the one asking questions, and the questions I asked are pretty significant problems that need to be addressed for this to be practical.

@cdata points out that there are good reasons to have a declarative 3D API for the web. I agree. But, native glTF isn't it, in and of itself. It would be a part. But, again, there is a lot more that needs to be addressed. And it's well trodden ground. X3D, VRML, and so on. Various explorations in the OpenGeospatialData Consortium (including ARML2, which I was part of).

In the long run, I believe we will have a declarative format, or more than one format. Especially for use cases I care about (involving long-running AR apps on head-worn displays), it's pretty much essential.

cdata commented 4 years ago

@blairmacintyre thanks for the follow-up. It's all good discussion, and like I said I'm excited to see it happening.

I agree. But, native glTF isn't it, in and of itself. It would be a part.

To build on what I think you are getting at here: glTF is a great file format for delivering complex 3D assets to the web, but I agree that it is only a facet of what I was striking at in my comment.

If glTF is the serialized scene graph, the thing that is missing from the web platform is a standard model for representing deserialized scene graphs. And, as you pointed out, there is a long history of (failed) attempts to realize such a thing on the web.

Speaking from my time working with users of <model-viewer>, I can say a few things with confidence:

Non-graphics-inclined users want to deliver complex 3D scenes to the browser
They see glTF as a reasonable basis for this, but usually they want additional features on top, including:
- Interaction (e.g., "the car door opens when the user clicks it")
- Composition (e.g., "show a table model next to a user-selected chair model")
- Style (e.g., "render the same chair, but change the base color of the upholstery")
They are optimistic that even if glTF doesn't do everything they need it to today, it will be an important part of the long-arc of 3D asset delivery on the web
- This is born out by the evident momentum and participation in the Khronos 3D Formats group

The question that has been digging at me for some time now is: can we use glTF's highly declarative scene graph representation as a blueprint for a scene graph representation in the browser?

Perhaps the answer is one part loader, one part built-in rendering facility.

Or, perhaps it is best to start with a high-level <scene> element and then deconstruct it in the future as content authors ask for more direct access to the rendered scene.

Perhaps it's something else entirely.

Regardless, I'm excited to continue reading everyone's thoughts about this topic 👍

TrevorFSmith commented 4 years ago

Glad to see these ideas getting a bit of attention. Here are a few links that provide a bit of the context of previous discussions that @blairmacintyre mentioned. In each Issue or repo there is a proposal for using glTF as the basis for a scene (or scene fragment) and in most cases there is a markup component that fills in some set of features around the ones provided by glTF.

Expose multitasking support for webXR: https://github.com/immersive-web/proposals/issues/15

Extending WebExtensions for XR: https://github.com/immersive-web/proposals/issues/43

Declarative XR Language: https://github.com/immersive-web/proposals/issues/44

Spatial favicons: https://github.com/immersive-web/spatial-favicons

The existing <model-viewer> library is an existence proof that we can provide a custom element that satisfies the quick "show a model" use case with no new standards. A-frame is an existence proof that we can provide a declarative syntax for rich interactive scenes with no new standards.

Let's think beyond a single use case and really dig into where standards and browser functionality will be generally useful in ways that cannot be achieved using existing APIs. If there are unsolvable problems with existing libraries then let's dig into whether they're inherently a lack of standards / browser support or they're gaps in the libraries that we need to fill.

One example: In PotassiumES I created a tool, KSS, that uses a single CSS-ish syntax to style DOM content as well as spatial content in a Three.js scene. There's a build step that separates out KSS styles into separate files (CSS for DOM and JSON for a spatial object model or "SOM") and then code to apply the spatial styles to Three.js. So, I'm using a JS library (including a separate CSS grid implementation!) to provide for spatial content what is natively supplied for flat content by the CSS standards and browsers.

Web developers who are comfy with existing flat web tech face a huge chasm to jump into spatial content and that's the place where we can do the most good. We should pick apart the problem and see what we can do to make the web treat all three display type (flat, portal, and immersive) and the many input types (kb/m, touch, wands, hands, voice, ...) as true peers as far as web developers are concerned.

sushraja-msft commented 4 years ago

Thanks for the post @TrevorFSmith, I think I am hearing JS can handle <scene> element so lets focus on what is missing from the platform. 😊

As your post explains that "Exposing multitasking", "Extending Web Extensions" require multiple contexts to render to the same 3D scene - so far the only way I can think of this working is through the browser rendering a shared scene graph that multiple contexts contribute to, something existing APIs can't do.

The core of my proposal is to delegate rendering a scene graph to the browser - I chose GLTF as the scene graph serialization format because of the momentum I see around it, the ease of development that exporting .glb from a 3D modelling tool and loading the scene in a page with few lines of markup with f12 support is appealing and should help developers jump the chasm from 2D to 3D.

Building <scene> element is a stepping stone to enabling XR scenarios, current web properties like say redfin's 3d view or ikea's kitchen planner could migrate to a <scene> enjoying a small step up in ease of development, performance. Later on an AR device the same <scene> can offer pop out views from the web page without context switching to an immersive view, additional UX could allow for placing the scene in the real world and then the 3D content would track better because of its native rendering. For now, I am focusing on native GLTF, so that in the future we can enable those AR scenarios.

Lack of expensiveness in GLTF for adding interaction, input, styling is something I think script libraries can compensate for. Such libraries can work on supplemental markup and modify the scene graph.

Kudos on PottassiumES! I'm going to spend sometime reading through comments on proposals you linked.

cdata commented 4 years ago

I love the picture you painted in describing what KSS does @TrevorFSmith. I hope that the web will natively support a 3D / immersive content authoring flow that looks like that someday.

I strongly agree with your sentiment that we should focus on missing capabilities. That said, there is a strategic reason I can imagine for defining a <scene> element despite the existence proof of <model-viewer> and others:

A high level DOM element like <scene> that supports a format like glTF might beget the internal notion of a scene graph
Features mixing 2D and 3D content (like annotations) could be layered on to <scene> over time, laying the groundwork for future compositional capabilities
As the internal notion of the scene graph matures, it could serve as the basis for an author-facing scene graph object model

This approach is reminiscent to me of how the <video> element eventually served as a basis for specifying shadow roots and their encapsulation properties.

This is just an idea. I do agree that it is best to focus on missing capabilities, and this is just one way I could imagine them arising on the web platform.

himorin commented 1 year ago

@AdaRoseCannon @Yonet close this along with #76 ?

trusktr commented 1 year ago

It seems like a pain that DOMMatrix is left-handed Y down, while GLTF (and Three.js, etc) are right-handed Y up.

How can this be solved?

What if we introduce a new Matrix4 class, that is right-handed Y up, and can be used anywhere that DOMMatrix can be with DOM APIs, but the only difference is that behind the scenes the browser engine converts to left-handed Y down when applying the transforms to CSS-rendered content? Is this possible? If so, then it'd make things really easy for the 3D world (besides small caveats like having to write negative Y values to match DOM content instead of positive Y values, and math libs for one system not working in the other very easily).

I've been working on LUME (3D HTML), which by default uses left-handed Y down on the top level API, but under the hood it maps to a right-handed Y up implementation using Three.js. I'm imagining that the Matrix4-to-DOMMatrix equivalent in a browser would be similar, but converting in the opposite direction.

immersive-web / proposals

Native GLTF #52

Prior Art

API Shape