Performance for viewing datsville

I noticed a number of issues described on the wiki. These issues have been a helpful reference when developing my own libraries and programs. Many of these can be addressed with new tools or updates to existing tools.

POV-Ray requires at least 16 GiB of RAM to process and render the model currently.

This is likely due to the studs. This can hopefully be improved by instancing the studs themselves in my Blender importer. Memory usage is much lower when not using logos on studs.

LDView requires several minutes to fully load Datsville into memory even on a modern computer.

LeoCAD takes a very long time as well. This should definitely be fixable at least for LeoCAD after fixing some unoptimized loading code. See https://github.com/leozide/leocad/issues/850.

In LDView I get framerates in the single digits when viewing Datsville. Sometimes less than 1 FPS.

I can easily get double digit framerates when not fully zoomed in with ldr_wgpu even on my M1 Macbook Air. Porting the code to C++ isn't really feasible, but it may be possible to improve rendering performance of LeoCAD slightly.

I've tried converting Datsville to OBJ format and loading it into Blender. The program simply crashes.

It imports in seconds on my machine with the new Blender importer ldr_tools_blender. Blender tends to not have the fastest asset importers and they don't always work well with instanced geometry. Dedicated import addons can often be faster.

LDView's the camera controls become less responsive when the camera distances are huge.

This could be performance related or simply not scaling the camera transform updates when zoomed out. I noticed that LeoCAD zooms slowly on large scenes as well. I've implemented more intuitive zooming speed before, so I'll look into seeing how difficult this would be add to LeoCAD.

LeoCAD's camera draw distance is too small for Datsville. Large parts of the model get clipped and become invisible.

The physical size of the model is problematic if the program doesn't have a sufficiently large far clip plane. I'm working on a new optimized LDraw renderer ldr_wgpu. ldr_wgpu uses an infinite far clip plane using a common depth buffer trick. I'll look into porting this to LeoCAD.

I have been working on a realtime LDraw renderer/editor in Monogame/C# using Datsville as a performance benchmark. Datsville has helped me weed out errors in my Blender importer as well.

Although I get a decent framerate, I still hit a bottleneck because there is such a massive amount of data being processed per frame. I'm talking multiple 10s of millions of vertices and indices and around ~60k objects on screen at ground level.

It also eats up about 5-6GB of RAM when everything is all loaded. This may be improved with some kind of streaming or deferred loading process, but fixing that is premature optimization at this point.

The single massive terrain mesh is over 8MB in triangles, the fence around the airport has low framerate due to the amount of elements that make up the fence, and the cornfield is something like 50k objects on screen at once. It is no surprise that when these are either not on screen, or not loaded at all, frame rates and import times improve.

I've experimented with a handful of optimization strategies. Surprisingly, replacing the meshes with simple colored cubes had almost no affect on framerate on my machine. Not importing studs had almost no affect on framerate. Only when I started limiting the number of objects being drawn do things improve. This also leads me to believe LOD meshes will have minimal impact, so I'm not prioritizing figuring that out.

Lowering draw distances helps, but there are so many objects that once you start getting a decent framerate, you can't even see one end of the airport to the other.

The best outcome I've had is where I merge the parts of individual models in a post processing step so that there aren't so many draw calls per frame, meaning there are much fewer entities to loop through each frame. I import every element as a color code 16 and apply the correct color in a shader, which cuts down on import time and RAM usage. It also improves framerates because I can draw more objects before having to switch index and vertex buffers.

I ran into problems with backface culling where models are mirrored, the Town Hall specifically, but it resulted in a huge performance boost. I will look into correcting these models, even if by hand, to see if it is really worth pursuing. Even if it doesn't help with this task, mirroring models is not recommended in the LDraw docs, so it's a worthwhile fix to make.

I'm also going to rework the terrain to be less dense and to be chunked out so parts that are not on screen can be ignored.

I know next to nothing about HLSL shaders, so everything is a basic color, so there are no fancy effects slowing anything down.

Instancing, interestingly enough, didn't seem to improve framerates much at all. Primarily because the instances need to be organized every frame, which is an expensive operation. I've tried a heap and a modified octree, and both perform pretty similarly. There may be a better way to do it. I will do further research on this.

I really think with a little bit of work, we can make this work.

I really think with a little bit of work, we can make this work.

I already have what I would consider acceptable framerates and loading times using ldr_tools and ldr_wgpu. Even my MacBook Air gets decent performance with an infinite draw distance. I plan on making the renderer into a dedicated library for people to use at some point. The optimizations are all documented on the ldr_wgpu repository. I would strongly recommend just using the Rust code, since it's based on WebGPU and will work on most modern GPUs and the web eventually. It's also going to be hard to match the performance even in a modern game engine.

If you really want to use MonoGame, I'll try and summarize some tips to improve your performance. You can also compile and run ldr_wgpu to use for a performance comparison while making optimizations.

It also eats up about 5-6GB of RAM when everything is all loaded.

Use instancing to reduce memory usage. Any handmade LDraw scene should fit easily in GPU or system memory. There's not enough unique data to need streaming. ldr_wgpu uses less than 1 GB when loading datsville. You can also try reducing the precision for colors and normals to save some space.

Surprisingly, replacing the meshes with simple colored cubes had almost no affect on framerate on my machine.

It sounds like you have too many draw calls. You can try instanced rendering to reduce the amount of overhead spent sending the drawing commands to the GPU. Whether it will actually make the GPU render faster or not depends on the scene.

I know next to nothing about HLSL shaders, so everything is a basic color, so there are no fancy effects slowing anything down.

You're probably bottlenecked by vertex processing. You can either reduce the vertex count or use vertex indices to reduce the number of points the GPU needs to calculate.

Primarily because the instances need to be organized every frame, which is an expensive operation.

Is there a reason you need to do this every frame?

If you want to improve GPU performance, you'll need to use some sort of profiling tool to see what actually takes up most of the time. GPU performance is complex, and it's difficult to accurately measure from your own CPU code. Hardware manufacturers provide their own tools like Nsight Graphics (Nvidia), Radeon GPU Profiler (AMD), or the frame profiler built in to XCode for MacOS.

If you really want to use MonoGame

I do. Nothing against any other project. This is how I relax. And it's nice to have a real problem to solve since my paying work has become fairly routine.

I'm not sure where the 5GB of RAM usage comes from. It is possible I'm not clearing some temporary lists, not that I'm thinking about it. The import process only creates one mesh object that is transformed by the unique objects that use the mesh, so it's not duplication of mesh data. It very well could be duplicate objects that only share a transformation difference. I will look into that. I might be able to adjust the precision of the normals and colors, but those calculations are done by Monogame classes and they work, so I hesitate to venture in that direction.

Is there a reason you need to do this every frame? I have an octree that is populated every frame that groups entities by part name and I pull the cached instance data from each entity and send that to the GPU. There are fewer draw calls, but more loops. Looking through it just now, I realized I can combine loops, which may improve performance

I will definitely look into the debugging tools, so thank you for that.

I redid my instancing approach, and the differences massive.

The world is scaled to 0.02 of the actual size.

At position -50, -150, -500 facing east looking at the Dennett Ave sign using an octree, I get 30FPS. With my new instancing approach, I get 60FPS.

My initial approach was to build the instance collection using only the visible entities. The rationale was that I only wanted to send instances that were on screen. This proved prohibitively expensive.

My new approach is to build the instances collection at import and draw every item in that collection regardless if it is on screen or not. This causes the GPU to draw every item in the collection, but it cuts down the draw calls from 35k to about 500. which improves framerates considerably. I am going to explore pruning not visible items to improve things further.

My next experiment is to see what effect LOD meshes might have.

Not loading studs raises the framerates to 90FPS. I will definitely look into stud instancing.

Rendering only colored boxes instead of the actual meshes has the same effect on framerates as disabling studs.

I'm running into issues with backface culling when models are mirrored. I haven't started on solving that yet.

mjhorvath / Datsville

Performance for viewing datsville #1