Significant Performance Drop and High CPU Usage with BatchedMesh

lanvada commented 3 months ago

Description

Hello,

I exported a building model from Revit in glTF format and merged meshes with the same materials to manage their visibility in Three.js using the BatchedMesh class. However, I've encountered a significant performance issue when rendering these merged meshes with BatchedMesh compared to using Mesh.

Performance Comparison:

BatchedMesh Rendering:
- CPU Usage: ~40%
- GPU Usage: ~30%
- Frame Rate: ~20 FPS
Mesh Rendering:
- CPU Usage: ~15%
- GPU Usage: ~90%
- Frame Rate: ~60 FPS

This drastic difference in performance is concerning, especially the high CPU load and low frame rate when using BatchedMesh. I've already set .perObjectFrustumCulled and .sortObjects to false in BatchedMesh, which, if set to true, leads to an even more severe frame rate drop.

Additionally, I'm using three-csm and postprocessing frameworks alongside Three.js.

System Configuration:

CPU: Intel i7-10700
GPU: NVIDIA RTX 2080 Super

Could someone help me understand why BatchedMesh increases the CPU overhead so significantly and suggest any possible optimizations or solutions to improve the frame rate?

Thank you!

Reproduction steps

Using BatchedMesh to render more than 10 million triangles and vertices, there are about 100,000 different geometries.

Code

Code in the project batched-mesh-performance-test

Live example

Code in the project batched-mesh-performance-test

Screenshots

No response

Version

r165

Device

No response

Browser

No response

OS

No response

lanvada commented 3 months ago

In the past few days, I’ve tried to find a solution, but without success. I’ve uploaded the relevant code and models to GitHub. The models consist of 12 million triangles and 16 million vertices. Is such a high CPU performance cost necessary for BatchedMesh? I don’t think it should be. When rendering Batched3DModel in Cesium, I didn’t encounter such issues. I believe the mesh batching in Cesium and BatchedMesh should be quite similar, right?

Additionally, I’d like to mention that after the update to version 166, the performance consumption of BatchedMesh has worsened, and the frame rate has dropped further in the same scene.

Here is the link to the code and models: batched-mesh-performance-test

gkjohnson commented 3 months ago

For the sake of easily understanding the issue please provide a live example that doesn't require pulling and running a separate Github project. You can host a demo page with Github pages, for example. Recordings of the Chrome performance monitor would be helpful, as well.

lanvada commented 3 months ago

Here is the demo link: https://batched-mesh-performance-test.vercel.app

The model is compressed using Draco and is approximately 44MB in size, with a total of 7.6 million triangles and 9.6 million vertices. It takes about 10 seconds to load the model. Initially, the page does not use BatchedMesh, and the frame rate on my computer is 60 FPS. You can switch to BatchedMesh by clicking the button on the bottom left, after which the frame rate drops to about 17 FPS.

lanvada commented 3 months ago

I need to provide some additional details. When exporting the glTF model from Revit, I grouped meshes with the same materials. I added three extensions: EXT_instance_features, EXT_mesh_features, and EXT_mesh_gpu_instancing. I also assigned a _FEATURE_ID_0 attribute to each vertex to differentiate between different batches, and this attribute is parsed during loading. The related code can be found in two TypeScript files in the project I previously provided: MeshFeatures.ts and GltfToolkit.ts. If you need to load the model for debugging, you might need to use the relevant code to parse the different batches. Since I have already used "_FEATURE_ID_0" to differentiate vertices of different batches, creating BatchedMesh could potentially be implemented by directly assigning values to internal properties (perhaps by renaming "_FEATURE_ID_0" to "_batchId"). However, I have not studied the BatchedMesh code in detail and have only used the BatchedMesh API to add geometries in a straightforward manner. This approach involves iterating over vertices and face indices and results in considerable additional memory allocation and copying overhead, making it inefficient.

gkjohnson commented 3 months ago

Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies.

lanvada commented 3 months ago

Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies.

Replicating this issue with a single BatchedMesh is actually quite "simple." You just need to increase the MAX_GEOMETRY_COUNT in the webgl_mesh_batch.html example to ten times its previous value. On my computer, when the geometryCount is 20,000, the CPU usage is around 20%. When the geometryCount is increased to 200,000, CPU usage rises to between 50% and 60%, yet GPU usage remains unchanged.

lanvada commented 3 months ago

Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies.

Replicating this issue with a single BatchedMesh is actually quite "simple." You just need to increase the MAX_GEOMETRY_COUNT in the webgl_mesh_batch.html example to ten times its previous value. On my computer, when the geometryCount is 20,000, the CPU usage is around 20%. When the geometryCount is increased to 200,000, CPU usage rises to between 50% and 60%, yet GPU usage remains unchanged.

Turning off the sortObjects, perObjectFrustumCulled, and useCustomSort options can reduce CPU usage by about 5%.

Additionally, I've noticed that enabling only the sortObjects option decreases the frame rate from 30 to 9. Is this a normal phenomenon?

gkjohnson commented 3 months ago

Replicating this issue with a single BatchedMesh is actually quite "simple."

I understand but I'm asking for a minimal reproduction case to be provided. I think it's a more than reasonable ask for a simple demonstration case separate from user code to made when reporting an issue and asking maintainers to spend time investigating. I can take a closer look once a this minimal repro is available.

Additionally, I've noticed that enabling only the sortObjects option decreases the frame rate from 30 to 9. Is this a normal phenomenon?

It depends on how many objects there are and where the bottleneck is. Frustum culling and sorting share a lot of the same logic, though, enabling one or the other will have a larger apparent impact then if one is already enabled and you enable the other. If you provide a simple reproduction case it will be easier to understand what you're describing.

lanvada commented 3 months ago

I can take a closer look once a this minimal repro is available.

It depends on how many objects there are and where the bottleneck is. Frustum culling and sorting share a lot of the same logic, though, enabling one or the other will have a larger apparent impact then if one is already enabled and you enable the other. If you provide a simple reproduction case it will be easier to understand what you're describing.

Ah, the case I mentioned above is actually based on the examples/webgl_mesh_batch.html. All I did was change the MAX_GEOMETRY_COUNT to 200,000 directly in the HTML, and then set it to this number in the browser. Give me a moment to fork this project and make the change, then I'll deploy it on Vercel. Alternatively, if it's convenient for you, you could just tweak the batch count limit in this example to replicate the issues I've mentioned.

lanvada commented 3 months ago

Sorry about this—I'm not very good at English, so I often rely on ChatGPT to help me write. If there are any impolite words or phrases, please forgive me...

gkjohnson commented 3 months ago

Ah, the case I mentioned above is actually based on the examples/webgl_mesh_batch.html. All I did was change the MAX_GEOMETRY_COUNT to 200,000 directly in the HTML, and then set it to this number in the browser. Give me a moment to fork this project and make the change, then I'll deploy it on Vercel.

If the sort behavior is separate from the original performance question then I'd prefer to focus on one thing at a time. You can ask at the forum if you'd like help understanding the performance implications of sorting objects.

Please provide a simple example in something like jsfiddle that shows the performance differences you're observing in https://github.com/mrdoob/three.js/issues/28776#issue-2383173340 without using any custom 3d model or complex feature processing logic.

lanvada commented 3 months ago

I've set up a page where you can switch between "BatchedMesh" and "MergedMesh". Here's the link: https://batched-mesh-performance-example.vercel.app/. Switching to "MergedMesh" might take about ten seconds or so.

What I've noticed is that when using "BatchedMesh", the CPU usage significantly increases—from 15% to 40% on my computer.

I did a quick debug with Spector.js and found that enabling the sortObjects option causes the texSubImage2D function to take up too much time, leading to a drop in frame rate. However, when I turn off sortObjects, only the multiDrawElementsWEBGL function remains. I'm wondering if the increase in CPU usage is a necessary cost of using multiDrawElementsWEBGL.

Also, another issue is when there are many materials in the scene (multiple BatchedMeshes or MergedMeshes), using the "MergedMesh" method allows the GPU to perform at its best, nearing 100% utilization. But with the "BatchedMesh" method, the GPU utilization seems to be about the same as when there's only a single material—around 30%.

I'm not sure if the above situations can be optimized, or is this just the nature of the WebGL API?

Shakhriddin commented 3 months ago

Replicating this issue with a single BatchedMesh is actually quite "simple." You just need to increase the MAX_GEOMETRY_COUNT in the webgl_mesh_batch.html example to ten times its previous value. On my computer, when the geometryCount is 20,000, the CPU usage is around 20%. When the geometryCount is increased to 200,000, CPU usage rises to between 50% and 60%, yet GPU usage remains unchanged.

If you set geometryCount to 200000, then the freezes are due to the update _indirectTexture. And of course, this is due to the fact that there are a lot of instances and do loop through them all takes a lot of time. You can see from this screenshot

@gkjohnson, @lanvada

gkjohnson commented 2 months ago

I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck:

jsfiddle link

I'm seeing that between the three options, BatchedMesh is the only one that suffers from this performance degradation. Instances and merged geometry both work fine otherwise. InstancedMesh and the merged geometry run at 120 fps while the BatchedMesh runs at ~30 fps on my 2021 M1 Pro Macbook.

In terms of why this is happening - my only guess is that it's due to the buffers of draw "starts" and draw "counts" that must be uploaded to the GPU for drawing every frame, which will amount to ~1.6 MB of data for 200,000 items. It's hard to say for sure, though, because this isn't showing up on the profiler. It's possible that this GPU data upload is happening asynchronously and not reflected in the profiler unlike some of the texture upload function calls.

In the original example all of the problematic BatchedMesh sub geometry draws seem to be unique so unfortunately without something like indirect draw support (supported in WebGPU) I think this is just pushing the limits of what we can do with BatchedMesh too far.

lanvada commented 2 months ago

I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck:

Thank you very much for your response and for creating a new example. Does this mean that the operation causing the increase in CPU usage on my computer could be the data upload to the GPU? Another phenomenon is that on my desktop with a dedicated GPU, the GPU utilization can reach over 80% in examples not using BatchedMesh, but with BatchedMesh, it only peaks at 30%. Could this be due to the GPU waiting for data uploads?

It's frustrating that whether it's the issue of rising CPU usage or the GPU not performing at full capacity, it seems to be a problem inherent to WebGL itself, and it appears to be unsolvable. However, you mentioned indirect draw support in WebGPU. If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks? If it's theoretically feasible, I might try switching the renderer in my current project to WebGPU.

gkjohnson commented 2 months ago

increase in CPU usage on my computer could be the data upload to the GPU ... Could this be due to the GPU waiting for data uploads?

If what I've suggested is the cause - then yes it would explain the higher CPU usage and less GPU usage.

If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks?

I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now.

lanvada commented 2 months ago

I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now.

Thank you for your insights. I'll look into the current state of three.js' WebGPURenderer and see if it supports the features needed to overcome these limitations. If it's not currently supported, I'll keep an eye on updates. Your explanation has been very helpful in clarifying the potential causes of the performance issues I'm facing.

lanvada commented 2 months ago

If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks?

I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now.

I switched to the WebGPURenderer in this example batched-mesh-performance-example, but unfortunately, I found that the frame rate with BatchedMesh is even lower now...

John-Simth commented 2 months ago

It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU.

lanvada commented 2 months ago

It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU.

What graphics card and operating system are you using? Also, which browser are you using? My graphics card is an RTX 2080 Super, and I'm on Windows using Chrome.

lanvada commented 2 months ago

It seems you are using batchedMesh in an incorrect way. You should create a single batchedMesh and then add meshes with the same material into it, rather than creating a separate batchedMesh for each individual mesh.

I definitely didn't make a mistake there; of course, I created only one BatchedMesh. You can also see through Spector.js that there is only one draw call. How could it be that multiple BatchedMeshes were created?

lanvada commented 2 months ago

It seems you are using batchedMesh in an incorrect way. You should create a single batchedMesh and then add meshes with the same material into it, rather than creating a separate batchedMesh for each individual mesh.

Are you referring to the "batched-mesh-performance-test" project? That example was too complex and is no longer in use. You can check this one instead: batched-mesh-performance-example. However, even in the batched-mesh-performance-test example, if you carefully read the code related to the creation of BatchedMesh, you would see that I created only one BatchedMesh for each identical material, not multiple BatchedMeshes.

John-Simth commented 2 months ago

It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU.

What graphics card and operating system are you using? Also, which browser are you using? My graphics card is an RTX 2080 Super, and I'm on Windows using Chrome.

My graphics card is an RTX 2050 4GB. I tested it batched-mesh-performance-example on Edge and Chrome and they both performed nearly 8 FPS in WebGL and 17-22 FPS in WebGPU! I'm not sure why I'm different from you.

nkallen commented 2 months ago

I deleted my previous post because I misunderstood something.

I believe in the current version of BatchedMesh, multiDrawArraysInstancedWEBGL is not used. It is not used in the examples provided by @gkjohnson and @lanvada

So what is being compared in examples above:

one call to multiDrawElementsWEBGL with very large starts/counts arrays (100k elements)
one call to drawElementsInstanced with one geometry and a large number for primcount (=100k)
one call to drawElements with one (giant) geometry

IIUC, the results are NOT actually surprising or that bad. multiDrawElementsWEBGL with large starts and counts arrays is an optimization on calling drawElements thousands of times. In practice, it means you can maintain 60fps with 40k virtual draw calls instead of 5k real draw calls (or VAO bindings).

The specific workflow of @lanvada which is revit CAD data, should probably not use multiDrawElementsWEBGL in this way. One single mesh is a great approach if it is static. But alternatively it should use multiDrawArraysInstancedWEBGL , since he has something like 800 unique geometries but many of instances of each. We don't really have a benchmark of that, but according to these presentiations of nvidia it should work well:

https://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf https://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf

gkjohnson commented 2 months ago

But alternatively it should use multiDrawArraysInstancedWEBGL , since he has something like 800 unique geometries but many of instances of each.

Unless there's something odd in the way the model data is being stored this isn't the case - the original demo in the OP creates InstancedMeshes for anything with instances, and then everything in BatchedMesh is a unique geometry. That's how it appears from the current parsing logic, at least.

IIUC, the results are NOT actually surprising or that bad.

Agreed but the surprising thing is that this uploading timing doesn't seem to show up at all on the measured performance metrics. It makes it difficult to understand where exactly this is coming from. But as I've mentioned I assume it's from the start and counts buffer uploads.

If there are practical use cases shown that multiDrawArraysInstancedWEBGL significantly improves in this respect I'm open to switching the BatchedMesh implementation. I just don't think it will address this specific case. cc @RenaudRohlinger

nkallen commented 2 months ago

I think as we get into these extreme performance cases where multiDrawArraysInstancedWEBGL might be beneficial, it's probably best for users to explicitly invoke the gl calls. It's not elegant, but it can be done using standard materials and onAfterRender to issue the draw call

RenaudRohlinger commented 2 months ago

Since the support of multiDrawArraysInstancedWEBGL introduced by https://github.com/mrdoob/three.js/pull/28103 in the WebGLRenderer it is still possible to have a custom BatchedMesh class that supports batch instanced draw calls by using the object._multiDrawInstances property (which I'm doing for a project).

Furthermore, using the same object._multiDrawInstances approach, I recently submitted two pull requests to reinstate multiDrawArraysInstancedWEBGL support in the WebGL Backend and to implement a compatibility for the WebGPURenderer: https://github.com/mrdoob/three.js/pull/28753 https://github.com/mrdoob/three.js/pull/28759

So it's a bit tricky but as long as we keep the _multiDrawInstances property we can still use multiDrawArraysInstancedWEBGL with both renderers without gl calls.

gkjohnson commented 2 months ago

is still possible to have a custom BatchedMesh class that supports batch instanced draw calls by using the object._multiDrawInstances property (which I'm doing for a project).

Of course but the goal here is to enable this without end-users having to write custom shaders to take advantage of the functionality. It's been suggested it would happen multiple times but it would be nice if someone shared a public demonstration of how multiDrawArraysInstancedWEBGL and _multiDrawInstances is being used in practice so we can discuss the pros / cons and how / if it should be used in a three.js class.

The big question for me is how you calculate the item index using the gl_DrawID and gl_InstanceID to sample from a tightly packed data texture / buffer (ie for the matrix transform texture) when you're drawing sets of instances with different counts.

nkallen commented 2 months ago

The big question for me is how you calculate the item index using the gl_DrawID and gl_InstanceID to sample from a tightly packed data texture / buffer (ie for the matrix transform texture) when you're drawing sets of instances with different counts.

The way that I am thinking about doing it is looking up an offset and a count in one texture based on gl_DrawID, and then addressing a second texture with offset + count * gl_InstanceId * sizeof_transform

I'm using this (experimental) translation of OffsetAllocator based on the work of Sebastian Aaltonen, which is how I manage buffers explicitly, it has extremely high occupancy. I'm not proposing to add this to threejs though

https://gist.github.com/nkallen/f4ed889dc98e9a9da7283a01e3308450

nkallen commented 2 months ago

But I should note also: it is possible to put everything in the same buffer and just issue thousands of calls to drawElementsInstanced. It's extremely fast because you do not need to switch the VAO. For example, in the below code, note that I am just incrementing the offset of the vertexAttribPointer. I have benchmarked this on apple and amd gpus and it can do 5k calls to drawElementsInstanced in < 1ms

    onAfterRender(...) {
        const { _multiDrawStarts_, _multiDrawCounts_, _multiDrawCount_ } = this;
        const gl = renderer.getContext() as WebGL2RenderingContext;

        gl.bindBuffer(gl.ARRAY_BUFFER, this.geometry.attributes.instanceStart.data.buffer);

        for (let i = 0; i < _multiDrawCount_; i++) {
            const start = _multiDrawStarts_[i];
            const primcount = _multiDrawCounts_[i];

            const offset = start + i * 24;
            gl.vertexAttribPointer(3, 3, 5126, false, 24, offset);
            gl.vertexAttribPointer(4, 3, 5126, false, 24, offset + 12);

            gl.drawElementsInstanced(gl.TRIANGLES, 18, gl.UNSIGNED_SHORT, 0, primcount / 6);
        }
    }

RenaudRohlinger commented 2 months ago

I implemented this a while back, and if I recall correctly, I used an extra data texture to perform the lookup between the offset and the count, composed in the onBeforeRender hook. This approach seems similar to what nkallen described in their last two comments.

Also in my implementation for simplicity and to handle larger data sizes for batching matrices (such as batch instanced skinning, where the number of matrices multiplied by the amount of bones is significant), I used sampler2DArray. This method does imply a limit of 2048 different geometries. But I remember struggling with the lookup and ultimately decided to use sampler2DArray to simplify the process.

gkjohnson commented 2 months ago

Thanks for the explanation! I thought there might be a method to for calculating this without an extra texture sample. It would be possible to pack these offsets into the beginning of the "indirect index" texture, though. We'd have to know how many geometries will be added up front to pack it perfectly tightly but if the capacity is reached then the texture could be expanded.

It would work like so:

// glsl

int size = textureSize( batchingColorTexture, 0 ).x;

ivec2 offsetPx = ivec2( gl_DrawID % size, gl_DrawID / size );
int offset = texelFetch( indirectIndexTexture, offsetPx, 0 ).r;

ivec2 indexPx = ivec2( ( offset + gl_InstanceID ) % size, ( offset + gl_InstanceID ) / size );
int index = texelFetch( indirectIndexTexture, indexPx, 0 ).r;

// use index to sample matrices, texture properties, etc

This would have downsides of not allowing for sorting for overdraw compensation or transparency between instance groups but could improve performance in extreme cases where a tone of instances need to be used in a BatchedMesh. Again it's not clear that this is what's need for OPs use case, though.

Anyway - I won't be working on this but it's something to keep in mind if this comes up or we want to make use of multi draw instanced more accessible. It may be possible to add something like a toggle BatchedMesh to switch between the two modes but I'm not sure how complicated that would be.

QuisMagni commented 1 month ago

I also found performance issues with the use of BatchedMeshes in one of my projects. In my example, only a single instance is used per mesh. When around 100 different materials are used, there are already significant performance differences between merged meshes and BatchedMesh. DrawCalls are exact the same (as expected).

Here is the example: https://codesandbox.io/p/sandbox/three-js-forked-g69j8w The scene is switching between batched mesh and merged mesh every 5 seconds.

lanvada commented 1 month ago

Here is the example: https://codesandbox.io/p/sandbox/three-js-forked-g69j8w The scene is switching between batched mesh and merged mesh every 5 seconds.

Yes, indeed, I just ran your example, and on my computer, using BatchedMesh got 47fps, MergedMesh got 60fps, and I observed a significant increase in CPU and GPU usage after switching to BatchedMesh.

QuisMagni commented 1 month ago

For dedicated desktop gpus it might be necessary to increase the geometry count (const c = 400) to get the fps drop under 60fps (or whatever your monitor likes).

lanvada commented 1 month ago

For dedicated desktop gpus it might be necessary to increase the geometry count (const c = 400) to get the fps drop under 60fps (or whatever your monitor likes).

I am currently using an RTX 2080s. I feel that in this not very complex scene, the frame rate dropping to 47 is already quite low; there is indeed a noticeable decrease in frame rate.

nkallen commented 1 month ago

If you don't have a dynamic scene you shouldn't used BatchedMesh. The purpose of BatchedMesh is to be able to show/hide individual objects, transform individual objects, etc., as well as do frustum culling and sorting objects by z in camera space. If you don't need any of that, then you are paying a significant cost for no reason. The total number of BatchedMeshes in any given scene should probably be much less than 100

QuisMagni commented 1 month ago

If you don't have a dynamic scene you shouldn't used BatchedMesh. The purpose of BatchedMesh is to be able to show/hide individual objects, transform individual objects, etc., as well as do frustum culling and sorting objects by z in camera space. If you don't need any of that, then you are paying a significant cost for no reason. The total number of BatchedMeshes in any given scene should probably be much less than 100

nkallen - thank you for your response! I'm aware of the advatanges of BatchedMesh and this is the reason why i want to use them. The total number of batches is driven by the number of different materials i want to use. I'm not sure why the total number of different materials should be limited to less than 100. In a practical project im working on we have 40 different materials for building environment objects and we want increase that number even more.

Since there is only one draw call per batched mesh and the array for the draw ranges is small (less than 100 entrys) i'm wondering why there is so much difference between the merged mesh draw call and the multi elements draw call. But maybe it is not about the draw calls its something else under the hood of three js preventing the BatchedMesh to perform like the merged geometry.

nkallen commented 1 month ago

Each BatchedMesh does sorting and frustum culling, which is dominating the CPU in your example (see BatchedMesh.onBeforeRender). Each BatchedMesh is also binding a few textures, which you aren't using, but it's a bit hard to know if that is what is dominating the GPU time. If you really have ~100+ materials the solution is to pack the material uniforms into a texture, as BatchedMesh currently does with colors only.

QuisMagni commented 1 month ago

Each BatchedMesh does sorting and frustum culling, which is dominating the CPU in your example (see BatchedMesh.onBeforeRender). Each BatchedMesh is also binding a few textures, which you aren't using, but it's a bit hard to know if that is what is dominating the GPU time. If you really have ~100+ materials the solution is to pack the material uniforms into a texture, as BatchedMesh currently does with colors only.

I created another fork and updated the example. Now the batched mesh is only using a single material - so it is overall only one single draw call for the whole geometry. It is still performing worse than the merged geometry rendering which is using much more draw calls now.

I even disabled per object frustum culling and sorting.

https://codesandbox.io/p/sandbox/three-js-forked-rfkcgt

Even with your solution combining all materials into one single material i would probably run into the same problems.

QuisMagni commented 1 month ago

Well the practical application is using a tweaked MeshStandardMaterial with textures and so on. I'm still thinking about your solution. Would it be feasible to combine all 40 materials with different color, normal and roughness maps into one monster material (using texture atlas with more than 40 entries and baked uniforms) and this way rendering all meshes using a single batched mesh?

I mean i know how to technically do it but is it worth the effort? Is it a good/recommended idea to go this way to have a single batched mesh?

nkallen commented 1 month ago

@QuisMagni realistically that sounds like a huge pain.... I think you should be fine with ~50 draw calls. You will need to benchmark to see where the issue is and then go from there. I currently work with about 20 BatchedMeshes, with a few of them being extremely large and the rest quite small. I'm easily achieving 60fps. But I have some custom sorting/culling logic and I don't use textures for the matrix transform.

lanvada commented 1 month ago

I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck:

jsfiddle link

I'm seeing that between the three options, BatchedMesh is the only one that suffers from this performance degradation. Instances and merged geometry both work fine otherwise. InstancedMesh and the merged geometry run at 120 fps while the BatchedMesh runs at ~30 fps on my 2021 M1 Pro Macbook.

In terms of why this is happening - my only guess is that it's due to the buffers of draw "starts" and draw "counts" that must be uploaded to the GPU for drawing every frame, which will amount to ~1.6 MB of data for 200,000 items. It's hard to say for sure, though, because this isn't showing up on the profiler. It's possible that this GPU data upload is happening asynchronously and not reflected in the profiler unlike some of the texture upload function calls.

In the original example all of the problematic BatchedMesh sub geometry draws seem to be unique so unfortunately without something like indirect draw support (supported in WebGPU) I think this is just pushing the limits of what we can do with BatchedMesh too far.

@QuisMagni

Please look at the comment above, currently, the performance degradation caused by using batched mesh when there are a large number of vertices and faces is unsolvable.

nkallen commented 1 month ago

Please look at the comment above, currently, the performance degradation caused by using batched mesh when there are a large number of vertices and faces is unsolvable.

Well... this is not really true. You have to understand what is going on and being compared. When we are working with high performance code we need to use specific techniques to the problem at hand.

The example above compares multiDrawElementsWEBGL with 100k duplicated items to drawElementsInstanced a few unique items with 100k instances. The latter is obviously much faster, and the comparison is irrelevant because there is also an instancing variant of multidraw, namely multiDrawElementsInstancedWEBGL. The two versions of instancing will have comparable performance.

The relevant comparison to an array of 100k multiDrawElementsWEBGL is 100k calls to drawElements -- or if we're comparing using vanilla threejs, to 100k calls to bindVertexArray and drawElements. The idea being rendering 100k UNIQUE geometries. In this latter case (bind+draw), multidraw is several thousand times faster, and in the former case it's faster by a factor of 10 or so

nkallen commented 1 month ago

I wanted to add this because I think it would be helpful for people. Assuming you have a dynamic scene (you need to transform, show/hide, or sort individual objects):

If you have < 1,000 unique objects, just use vanilla threejs (bindVertexArray + drawElements)
If you have < 1,000 unique objects but many thousands of instances, just use vanilla+instancing (drawElementsInstanced)
If you have > 1,000 unique objects, use multiDrawElementsWEBGL
If you have > 1,000 unique objects AND many hundreds/thousands of instances, use multiDrawElementsInstancedWEBGL

If your scene isn't dynamic and the geometry is small enough (say < 1gb), merging everything into one buffer can sometimes be best. The key thing is to understand the problem you are trying to solve, and understand what BatchedMesh is doing (#3 only!)

QuisMagni commented 1 month ago

I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck: jsfiddle link I'm seeing that between the three options, BatchedMesh is the only one that suffers from this performance degradation. Instances and merged geometry both work fine otherwise. InstancedMesh and the merged geometry run at 120 fps while the BatchedMesh runs at ~30 fps on my 2021 M1 Pro Macbook. In terms of why this is happening - my only guess is that it's due to the buffers of draw "starts" and draw "counts" that must be uploaded to the GPU for drawing every frame, which will amount to ~1.6 MB of data for 200,000 items. It's hard to say for sure, though, because this isn't showing up on the profiler. It's possible that this GPU data upload is happening asynchronously and not reflected in the profiler unlike some of the texture upload function calls. In the original example all of the problematic BatchedMesh sub geometry draws seem to be unique so unfortunately without something like indirect draw support (supported in WebGPU) I think this is just pushing the limits of what we can do with BatchedMesh too far.

@QuisMagni

Please look at the comment above, currently, the performance degradation caused by using batched mesh when there are a large number of vertices and faces is unsolvable.

Yes i saw the answer before. My example is quite different. In my first example there are only 60 geometries (later i increased it to 200) for every batched mesh - resulting in 6000 items for 100 batched meshes if i am correct. So in comparison with your example it is only 3% of the item count of the example before and there ist a huge performance hit of around 30% or even more. This is the reason why i was curious if there is something else going wrong under the hood.

QuisMagni commented 1 month ago

I wanted to add this because I think it would be helpful for people. Assuming you have a dynamic scene (you need to transform, show/hide, or sort individual objects):

If you have < 1,000 unique objects, just use vanilla threejs (bindVertexArray + drawElements)

If you have < 1,000 unique objects but many thousands of instances, just use vanilla+instancing (drawElementsInstanced)

If you have > 1,000 unique objects, use multiDrawElementsWEBGL

If you have > 1,000 unique objects AND many hundreds/thousands of instances, use multiDrawElementsInstancedWEBGL

If your scene isn't dynamic and the geometry is small enough (say < 1gb), merging everything into one buffer can sometimes be best. The key thing is to understand the problem you are trying to solve, and understand what BatchedMesh is doing (#3 only!)

In my example, 6000 geometries are rendered. The performance with BatchedMesh is 30-50% worse compared to merged geometry. This impact is so significant that it might be worthwhile to replicate the necessary dynamic operations (geometry updates, visibility updates, and culling) based on merged geometry and with the help of web workers. I was simply amazed at how poorly BatchedMesh performed in direct comparison, even in such smaller scenarios.

nkallen commented 1 month ago

In my example, 6000 geometries are rendered. The performance with BatchedMesh is 30-50% worse compared to merged geometry. This impact is so significant that it might be worthwhile to replicate the necessary dynamic operations (geometry updates, visibility updates, and culling) based on merged geometry and with the help of web workers. I was simply amazed at how poorly BatchedMesh performed in direct comparison, even in such smaller scenarios.

In your case, I would try to see where the bottleneck is. You should be able to disable frustum culling and sorting pretty easily. CPU usage should drop to basically zero. How much is the difference at that point? The remaining discrepency should be GPU overhead must come from somewhere, but it could be the textures, the multidraw arrays, or -- less likely -- inherent overhead in calling multiDrawElementsWEBGL.

I would then compare it to merged geometry without rebinding the VAO: override onAfterRender and explicitly call gl.drawElements in a loop. Three'js will have already bound the vao and set the program (material/shader).

You can then decide where to go from there. I am skeptical sorting and frustum culling in a worker will end up being the preferred approach... I can't know in advance but it seems like the theoretical maximum webgl performance would come from a loop like the following:

bindVAO(); // one geometry shared by multiple objects
for (const material of materials) { // one object3d per material
  setProgram(material);
  for (const texture of material.textures) { // onAfterRender
    bindTexture(texture);
    for (const [count, starts, counts] of texture.multidraw) {
       gl.multiDrawElements(...);
    }
  }
}

lanvada commented 1 month ago

If you have > 1,000 unique objects AND many hundreds/thousands of instances, use multiDrawElementsInstancedWEBGL

@nkallen Alright, my current use case involves rendering BIM architectural models, which contain a large number of different objects (walls, floors, pipes, etc.), as well as many objects that can be instantiated (doors, windows, etc.). I believe I fall under the fourth scenario, right? I've looked into the multiDrawElementsInstancedWEBGL code, and I found that multiDrawElementsInstancedWEBGL is only called when the _multiDrawInstances attribute in BatchedMesh is set to true. However, currently, there's no interface to modify the _multiDrawInstances value. I wanted to ask if the related functionality is still incomplete?

nkallen commented 1 month ago

Alright, my current use case involves rendering BIM architectural models, which contain a large number of different objects (walls, floors, pipes, etc.), as well as many objects that can be instantiated (doors, windows, etc.). I believe I fall under the fourth scenario, right? I've looked into the multiDrawElementsInstancedWEBGL code, and I found that

Yes #4 is my best guess for you, although if you don't have a dynamic scene and the amount of data isn't enormous you can also materialize everything into one buffer and just render in drawElements call (or one drawElements call per material).

But since it all depends on: how dynamic is the scene, how many triangles, how many materials, how many instances, how many unique objects, etc: we can't know what is best without benchmarking all of the options. I would test each option in onAfterRenderby calling the raw gl functions directly. Disable rendering in onBeforeRender (e.g., setDrawRange). So threejs will be just setting the program and the vao for you. You should be able to get a basic order of magnitude for each approach and go from there.

mrdoob / three.js