Very slow import when scene has big meshes due to mesh LOD generation

PZerua commented 2 years ago

Godot version

4.0.alpha14

System information

Windows 11, Intel i7-10750H, Nvidia RTX 2060 Laptop (511.65), Vulkan

Issue description

Godot takes a lot of time to import a scene with big meshes. To better understand the problem I've been doing some tests with the new "Colorful Curtains" (without Base Scene) from the new Intel's Sponza scene. The scene has twelve 4K textures and several meshes that add up to a total of 1.059.862 vertices, and while Blender only takes ~7 seconds, Godot takes ~1 minute and 3 seconds to import. I've spent some time investigating the causes with a profiler and I've found that from that import:

~50 seconds are spent on LOD generation
~13 seconds are spent on Texture import

So I'd say the main issue is with LOD generation. Two observations:

Using meshoptimizer, we start with a target of 12 indices (last LOD with 4 triangles), and multiply this number by two until we reach 75% of total indices. This works well for small meshes, but in this case a single mesh can have up to ~100.000 vertices and ~600.000 indices, generating 12 LODs. Most meshes in this scene generate around 10 LODs per mesh. This process takes 17 seconds of the total time.
For each LOD, Godot performs a "normal reconstruction" where between 16 and 64 rays are casted from the face of each new triangle over the original mesh, so new triangles can preserve an aproximation of the original direction in their normals. This reconstruction takes a very long time because it requires in my test scene to cast a total of 10.739.327 individual rays using Embree. This process takes 33 seconds of the total time.

Some possible changes I can think of to make it faster:

Make "Number of LODs" a mesh import setting, so we could have a default of 4 or 6 and let the user decide the total number. I think that's what Unreal 4 and Unity do.
Make the initial target of 12 indices and/or the target index multiply step (2 right now) dependant on the total number of indices, so both smaller and larger meshes have similar number of LODs.
Right now we start from smallest index target to largest and the original mesh is always used as a base when a new LOD is created. So instead start with the largest index target and use the previous LOD as the base for each new one. I remember reading this is more efficient but has slightly worse results, can't recall where.
Use meshopt_simplifySloppy, which seems to be x6.6 more performant, and test if it has big perceptible differences and wether it's worth changing or not.
Explore other ways to do the "normal reconstruction" or reduce the number of rays per triangle.

I explained the issue a bit over Rocket.chat a few days ago, but I'd like some discussion on this before I attempt a fix (if I'm capable).

Steps to reproduce

Download "Colorful Curtains" or any scene from the new Intel Sponza. Move GLTF scene to project folder and see it takes a very long time to import.

Minimal reproduction project

No response

fire commented 2 years ago

The largest cost in your numbers is casting a total of 10.739.327 individual rays using Embree. This process takes 33 seconds of the total time. Is there a way to improve this?

The smaller issue is the number of lods and the bigger issue is the normal reconstruction.

Sloppy simplify didn't give edge lengths, so I don't think we can use. Can check again.

fire commented 2 years ago

Did some thinking. One cheap thing we can do is start from the last lod rather than from the start. The code to do this is relatively small.

Do you want to make a pr for that?

PZerua commented 2 years ago

Hi, sorry for the delay.

The largest cost in your numbers is casting a total of 10.739.327 individual rays using Embree. This process takes 33 seconds of the total time. Is there a way to improve this?

I agree that is the bigger problem, but I haven't researched enough to come out with a solution or alternative approach. I did test using rtcIntersect1M once for all the rays instead of rtcIntersect1 for each single ray, both with RTC_INTERSECT_CONTEXT_FLAG_COHERENT and RTC_INTERSECT_CONTEXT_FLAG_INCOHERENT, but I noticed no difference. I have no prior experience with Embree, so might be worth trying again in case I did something wrong. Maybe @JFonS has some input on this and can propose some alternatives.

The smaller issue is the number of lods and the bigger issue is the normal reconstruction.

The thing is that the total amount of rays is directly related to the amount of LODs (and the amount of indices in each LOD), so if we agree >= 10 LODs are too much and aim for a maximum of 6 or 8, that would help for both issues.

Did some thinking. One cheap thing we can do is start from the last lod rather than from the start. The code to do this is relatively small.

Do you want to make a pr for that?

Yeah, I also thought on the same thing. This will speed up LOD generation when calling meshopt_simplify (although not sure how much), but won't help with the total ray count. I can give it a try in a few days.

fire commented 2 years ago

I can't promise anything but if you're around I can show you where the code for start from the last lod.

https://github.com/godotengine/godot/blob/master/scene/resources/importer_mesh.cpp#L453

The theory is instead of the last merged_indices_ptr, you use the last while loop new_indices.

PZerua commented 2 years ago

Hi, sorry for the delay, I've been quite busy at work.

Still want to work on this and I think I have an idea of how to implement it, but not sure when I'll have time to do it.

Also, I spent some time trying to understand better the context of the "normal reconstruction", and saw the discussion you had here: https://github.com/zeux/meshoptimizer/issues/158. So my understanding is "normal reconstruction" is currently a workaround for that issue and we should just wait to be fixed from meshoptimizer's side, although maybe is worth checking faster approaches in the meantime.

PZerua commented 1 year ago

Looks like we might get a fix for the wrong normals after simplification https://github.com/zeux/meshoptimizer/pull/524. Hopefully this will make possible to remove all the calls to Embree and make import much faster.

zeux commented 1 year ago

I'll note that it's unclear if the pending work in the linked meshoptimizer PR will allow Godot to change its simplification strategy - meshoptimizer version used in Godot right now has some patches to enable attribute awareness, but they were likely insufficient to get good normal quality in certain cases which is why the reprojection code exists. My goal is to improve on the patches currently used in Godot (they have some quality bugs that are critical to resolve before I can merge anything), but I don't know if the improvement is going to be sufficient to just rely on output of meshoptimizer directly in all cases.

fire commented 11 months ago

One of the bottlenecks is tangent space normal generation which is being worked on here https://github.com/godotengine/godot/pull/83648

zeux commented 3 months ago

Should be improved by #93727 (still some work to do in the future wrt reordering LOD generation from large to small).

godotengine / godot