Import a large glb file (778MB) which contains 800 models will crash the editor.

AllenDang commented 1 week ago

Tested versions

4.2 stable

System information

macOS 14.5 - forward+ - godot 4.2 stable

Issue description

Import a large glb file (778MB) which contains 800 models will crash the editor.

Steps to reproduce

Create a new project.
Drag and drop the large glb file into editor.

Minimal reproduction project (MRP)

Here is the glb file https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing

fire commented 1 week ago

Can you check 4.3? The cow data size was increased to a larger number

Sluggernot commented 1 week ago

Tried on latest from github (4 or 5 days ago). I hang on import. Restarting the editor restarts and re-hangs the import, automatically. For some reason my Attach to Process is being disconnected and reattaching it doesnt show me the Call Stack. (Mind currently blown.) Just pulled latest and recompiling.

lvcivs commented 1 week ago

I tried this on 4.3.beta2.official and although it was very slow, it did eventually load after about 6 minutes (during the whole time it appeared stuck at 0%). grafik

Opening the scene took a couple more minutes: grafik This was on Ubuntu 24.04. Edit: Godot uses about 9 GB of RAM with this scene open.

AllenDang commented 1 week ago

@lvcivs I created this file just for testing purpose, want to see how godot will handle it :P

JekSun97 commented 1 week ago

After transferring the model to Godot 4.3 beta2, it still didn’t load for me, I waited 28 minutes, then closed it. I also tested this on Blender 3.6.2, waited 3 minutes and Blender closed itself, which didn't happen with Godot.

Godot v4.3.beta2 - Windows 10.0.19045 - Vulkan (Mobile) - dedicated Radeon RX 560 Series (Advanced Micro Devices, Inc.; 31.0.14001.45012) - Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 Threads)

fire commented 1 week ago

Next steps is to get profiles for the load.

My recommendations is use either https://github.com/mstange/samply or https://superluminal.eu/

Sluggernot commented 1 week ago

Yes. I have been able to load the file. I did some quick benchmarking with Visual Studio and have a couple of very small efficiencies made locally. I need to benchmark the before and after when I get some really good changes made to this. Main finding is that _parse_meshes is the main function loading this file. My changes are to GenerateSharedVerticesIndexList and one small one to static SVec3 GetPosition().

fire commented 1 week ago

I will try to review any pull requests that can improve load times on the 777mb glb with nothing broken.

Sluggernot commented 1 week ago

Oh... Nothing broken? Ah, nevermind then. Really, yes my first challenge is proving that it is faster. Thanks!

Sluggernot commented 1 week ago

Ok, I didnt know github would add these comments from my own fork because I referenced the Issue in the description. I will be avoiding that in the future.

zeux commented 6 days ago

Since I ended up looking into this a little bit, I'll share my findings in hopes that it will help.

Measured by clicking "Reimport" on the scene in an otherwise empty project, --verbose says import took 276 seconds (that's a little under 5 minutes). Note that the scene has ~800 meshes that add up to ~39.3M triangles (~50k each, looks reasonably uniformly distributed). Overall I would have expected one mesh per scene here, but I'm not familiar with how Godot workflows work, and it's a good stress test regardless.

perf profile on Linux / editor build with default settings with fno-omit-frame-pointers -- please note that timings add up to 45% (perf doesn't normalize them):

Renormalizing the percentages by dividing by 0.45, and focusing on significant underlying components, we get:

5% scene save
14% tangent space generation
25% normal reprojection after LOD generation (raycasts)
29% simplification (meshopt_simplify)
24% the rest of generate_lods (it's inlined here so hard to see from the profile exactly)

In aggregate, LOD generation takes ~78% here, so definitely good to focus on that. When looking at something like a 5-minute import though, my expectations are usually that small gains are not terribly exciting, so something more significant needs to happen.

A note on the scale here: each mesh gets approximately 6 LOD levels generated. The work for meshopt_simplify scales with that; the work for normal reprojection scales with the total number of rays, which scales with the total number of triangles in all LODs, times the area factor - looks like we cast 16..64 rays which is a lot of rays :)

If I were tackling this problem, I would entertain the following projects:

For scenes with many large meshes like this, my first goal would be to process meshes in parallel. I'm not familiar with the details of ImporterMesh code but superficially nothing should prevent fully generating each mesh in parallel. Maybe that requires refactoring some of this code to actually be thread-safe. It would also require making sure that the dependent code is thread-safe internally - meshopt definitely is, I assume so is Embree, but some care would be required. That alone would probably get this to be under a minute on an 8-core system if we discount tangent space generation.
I'm skeptical that tangent space generation is efficient here. For a sense of scale, meshopt_simplify does a fair bit more work per call, and it's called ~6 times per mesh here and still only takes twice as much time. I would assume tangent space generation has internal algorithmic inefficiencies and could be improved, but I haven't looked at that code myself.

I would not advise trying to optimize the internals of meshopt_simplify (trust me...). Some small future performance improvements are planned here in meshoptimizer but largely speaking unless this runs into some edge case, which it doesn't look like it does to me, it should be very well tuned already. Same for Embree - I would assume it's impractical to optimize that to the degree that is relevant here. However:

I would certainly think of, at the minimum, reducing the amount of requested work from both meshopt_simplify and Embree here. Notably, meshopt_simplify is called approximately 6 times per mesh here and is asked to generate larger and larger meshes. Because of this, it does more or less the same amount of work: simplifying the mesh 2x is almost the same effort as simplifying the mesh 10x (... well, not quite, but it gets there quickly). However, in LOD chain generations you can usually generate the LODs in the opposite direction: start by requesting a ~1.5x smaller mesh, if that target is reached, ask for ~1.5x smaller mesh again, etc. I don't recall why the order here is reversed but I would consider flipping it and simplifying from the last LOD. I don't think that's going to reduce the work here 6x, but I would expect something like 3-4x improvement in cost to call simplify.
In a similar vein, casting 16-64 rays per triangle is a lot, especially for higher levels of detail. I would probably reduce this in general or at least scale this as the LOD levels get closer to original mesh: in the limit, we're casting at least 16 rays per triangle here for something that only has 1.5x fewer triangles than original mesh, and that just feels wasteful. This has a risk of reducing the quality of the resulting normals because there's a higher chance of missing the mesh or hitting a wrong triangle. Maybe ray casts here aren't the right fit and averaging triangle normals from triangles that are in a bounding sphere of the generated triangle is better, but this brings me to my final point:
We've already discussed this at some point in another issue, but overall I'm not 100% sure the current normal processing in the importer for LODs is generally beneficial. With the normal aware simplifier with the recent fixes, generally speaking I'd expect decent normals to come out of the simplifier itself. Sometimes that's not the case, but I'm not sure the ray cast logic is perfect either, and it's just a lot of complexity to always keep in mind. I do think the reindexing that happens in this code is beneficial for some faceted meshes though. So a good use of time would be to perhaps introduce an option for normal reprojection that would disable the ray cast based normal recreation (I'd expect that alone cuts half of the overhead of LOD generation here), test the option in a release, then maybe default it to skip the normal recreation and see if this comes up.

Hopefully this is helpful :) I would be happy to discuss (3)/(5) further and/or maybe contribute a patch or two as I'm generally interested in making sure simplification integration is working well for Godot; I'll leave 1/2/4 to others if they are motivated to work on this.

zeux commented 5 days ago

On "I'm not 100% sure the current normal processing in the importer for LODs is generally beneficial", I decided to do a quick comparison on the scene from this file. It looks like it's easy to disable normal override, basically just need to disable the ray caster creation (as mentioned earlier, I believe current splitting logic to be generally beneficial for faceted meshes). I then look at a few low LODs (where the risk of picking a bad normal due to ray casts is maximized), by tuning the LOD bias to be a very small value.

On the left (yes, left, I double checked!) is the import without using the raycaster. On the right is current master (raycaster enabled). Both levels are at ~2200 triangles. I see somewhat similar issues on a few other models - this is not universal, this happened to be the first model I checked, and some models from this scene look about the same with or without the raycaster enabled. But this to me is strong evidence that raycaster should be optional, and probably opt-in.

I've switched to using a smaller version of the scene from the original post (that one has 800 meshes but each mesh is duplicated 8 times, I've switched to a deduplicated version where there's only 100 meshes, easier to work with and faster to reimport). Reimport takes 37 seconds on master and 22 seconds without raycaster enabled.

Sluggernot commented 5 days ago

Wow, well that is surprising. Are there any examples where the raycaster was better in visual fidelity. (I understand that's somewhat subjective but your above screenshot feels fairly objective as to which is "better.") I've been diving further into this section of code throughout the day today, attempting to rally myself before trying multithreading. I really appreciate your write-up. This is absolutely great to see!

fire commented 5 days ago

As someone who works on this, I am supporting changes that improves quality and performance. Can review and help test.

godotengine / godot