tefusion / godot-subdiv

Fast Subdivision in Godot with opensubdiv
https://godotengine.org/asset-library/asset/1488
MIT License
42 stars 1 forks source link

Runtime subdivision performance is slow #20

Open fire opened 1 year ago

fire commented 1 year ago

Port metal shaders from opensubdiv? Unknown amount of work

tefusion commented 1 year ago

Even if it's not that much work I don't think this is worth it. There's a bunch of other handling within opensubdiv which makes it possible to enable adaptive subdivision (not implemented yet, probably gonna add it as an editor setting) and also made it easy to interpolate skinning and UV's. Porting all of that seems to be too much work overall for the benefit of it imho.

What I had a short look at instead was BFR. It was newly released with opensubdiv 3.5 a month ago. I don't fully understand what the downside of it is ("repeated evaluation of a fixed set of points" is still better done with far, which would be the current baking implementation). Upsides I saw are you can set it to triangulate quads at the end and it's apparently faster than Far based table solution, which we don't even currently use, so performance should see a significant increase for runtime subdivision (SubdivMeshInstance3D with skinning is very laggy right now on high levels.)

It also didn't look too complicated to implement from the tutorials I tried out and might be worth trying out instead. All of this is future stuff though, I don't think I'll work that much on this project myself in the near future except bugfixes/stabilization. I just needed semi quick subdivision for a project I'll continue working on now.

fire commented 1 year ago

I'm perfectly happy with what godot_opensubdiv has too. We're using float=64, so I have to debug that crash. The msvc compile time bug isn't critical.

tefusion commented 1 year ago

Hi again! I currently use the topology data stuff without subdivision for a character with lots of blendshapes and think I got the 2 major performance problems down now.

1. Triangulation Code

There is a lot of problems with this one. First of it's using SurfaceTool internally and generates tangents every single time if it has normals. That takes combined with the other triangulation code a total of 76ms for an around 10.000 vertices mesh. Remove that and we're already down to 14ms. Remove it entirely and resize lists before to not have to append you can half that again. With caching the index array and other stuff this probably can be cut down to half of that again. So I won't implement this right away, but this is something I'll definitely do when I have the time.

2. Subdivision itself

Forget my former post the new bfr is only really suitable as a replacement for adaptive subdivision which imo looks ugly on most things and only really is suitable for faraway objects (I just looked at the demo stuff, but e.g. a cube has a lots of free spots).

Instead what we should use is StencilTables to be able to actually use all the different fast subdivision options opensubdiv has. They provide a tutorial for the StencilTables so that should not be too hard, actually using it with the different GPU computation libraries might take longer though, but all official examples use them so atleast there is a lot reference.

fire commented 1 year ago

I think we can do a optimized implementation of what we use SurfaceTool for.

What are stencil tables?

tefusion commented 1 year ago

Stencils are used to factorize the interpolation calculations that subdivision schema apply to vertices of smooth surfaces. If the topology being subdivided remains constant, factorizing the subdivision weights into stencils during a pre-compute pass yields substantial amortizations at run-time when re-posing the control cage.

Factorizing the subdivision weights also allows to express each subdivided vertex as a weighted sum of vertices from the control cage. This step effectively removes any data inter-dependency between subdivided vertices : the computations of subdivision interpolation can be applied to each vertex in parallel without any barriers or constraint. The Osd classes leverage these properties by exploiting CPU and GPU parallelism.

from https://graphics.pixar.com/opensubdiv/docs/far_overview.html

The second part is the important thing. Currently just simple subdivision algorithms are being used here with no real parallelism and stencils make it possible to use OpenCL/GLSL/... or also CPU parallelism. I still don't fully know how to implement it well so that someone could also come around and use the library they like. My current bet is to do something like this