Performance | Mesh > 100K slow to process & memory hungry

jannessm / quadric-mesh-simplification

Fast python implementation of the quadric mesh simplification algorithm from http://mgarland.org/files/papers/quadrics.pdf

MIT License

105 stars 10 forks source link

Performance | Mesh > 100K slow to process & memory hungry #7

Open FreakTheMighty opened 4 years ago

FreakTheMighty commented 4 years ago

@jannessm I know you're quite busy, but just wanted to capture my findings when processing larger meshes. I've found that when decimating meshes larger than ~100K triangles decimation grinds to a stop.

For examples:

    sphere = trimesh.creation.icosphere(subdivisions=7, radius=1.0, color=None)
    simplified = simplify_mesh(np.array(sphere.vertices),
                               np.array(sphere.faces, dtype=np.uint32),
                               len(sphere.vertices) // 2)

Some informal measurements:

This consumes ~1.3GB of memory. I think that ~161MB of that is the mesh.
I haven't yet seen the algorithm complete. I'm not sure the longest I've let it run, but I think I may have let this run for hours at some point

The closest comparison to this library that I have for comparison is a decimation implementation in OpenMesh. It doesn't have all the same features. For example, it doesn't seem to handle edge preservation as well, so it may not be a fair performance comparison. Nonetheless, it decimates this sphere in less than 1 second and I believe uses less than 1GB of memory.

FreakTheMighty commented 4 years ago

Here's a slightly more formal performance test:

def decimate_sphere(sphere):
    simplified = simplify_mesh(np.array(sphere.vertices),
                               np.array(sphere.faces, dtype=np.uint32),
                               len(sphere.vertices) // 2)

def test_large_object_decimation():
    for i in range(1, 7):
        sphere = trimesh.creation.icosphere(subdivisions=i, radius=1.0, color=None)
        t = timeit.timeit(lambda: decimate_sphere(sphere), number=1)
        print(f'Decimated {len(sphere.vertices)} vertices in {t} seconds')

Decimated 42 vertices in 0.0004005999999208143 seconds Decimated 162 vertices in 0.0032245000002149027 seconds Decimated 642 vertices in 0.023680500000409666 seconds Decimated 2562 vertices in 0.33720790000006673 seconds Decimated 10242 vertices in 6.09463759800019 seconds Decimated 40962 vertices in 229.23157340999933 seconds

jannessm commented 4 years ago

yes, I know that my performance is quite bad. The problem is that I needed the opportunity to reduce the mesh to an exact amount of vertices. Not each vertex under a certain threshold. So I sort the errors each iteration to remove the smallest next. This needs a lot of memory to organise the data in an efficient way. For example, am I sorting the errors on a heap, because I only need the smallest. For the rest, I don't care how they are sorted. The original paper also has better reduction times. But I couldn't reach these times.

Also I needed the exact interpolation between the feature vectors. So I couldn't use the exact solution of the new optimal vertex. Therefore, I must try 10 times each edge, which the best and new position is.

There are implementations that do not precalculate the errors neither sort them. They just merge each vertex below a given threshold. This is faster and does not need that much memory. But is not that accurate (see here).

But since this would be a complete different implementation, I just implemented everything regarding to my needs.

FreakTheMighty commented 4 years ago

I think I follow the first point. Your use case requires that you hit a specific vertex count and requires high accuracy. You need to ensure that you're removing the highest error nodes so you need to re-sort after every iteration to ensure you truly are removing the highest error nodes until you hit your threshold.

For your second point, it sounds like the calculated output features are interpolated from the values of the collapsed source vertices and the new vertex position. I'm not clear on why you need to "try 10 times each edge." Does this impact the performance when there are no features provided?

One thing I was curious about was the impact of threshold, if any, on performance. I thought increasing the threshold might reduce the number of vertices that are compared within a given mesh, but I don't think I'm seeing any performance difference when playing with that parameter.

BTW, please feel free to close this if you feel this issue doesn't align with the goals of your library.

jannessm commented 4 years ago

Actually, I am removing the lowest, because the error states how much the shape is altered after collapsing.

10 is just an arbitrary number I chose. As illustrated in the paper, the solution of the optimal new position is the solution of a linear problem. For this a matrix has to be invertible, to do the maths. But this is not always the case. And because I would need to find out how both original vertices are "mixed" into the new one, I skip this step which would be roughly O(1) (if the matrix is invertible) and always calculate 10 linear combinations of the original vertices => O(10). So this is one downside of my implementation. There would be the opportunity to use the exact solution in the case where no features are provided. This may speed up the reduction time a bit.

And currently, the threshold is a threshold to allow vertices to collapse that are not linked by an edge (see paper). Therefore, it will increase the amount of "valid pairs" that may be collapsed.

Lastly, it is a limitation currently that very large meshes cannot be reduced fastly. So I keep this issue open. ;)

Thank you for your huge interest and support!

FreakTheMighty commented 4 years ago

I was thinking about the accuracy vs. performance trade off you mentioned. What if there were some sort of batch parameter that defaults to 1. This parameter would indicate how many verts are collapsed on each iteration before reevaluating the heap.

By way of example, with a value of 1, every time a collapse happens, the heap will be re-evaled, but with a value of 100 a batch of 100 verts could be removed before re-evaluating.

I'm not sure if this sort of slider would be useful to you, but I think without realizing it, this tradeoff has become very useful to me. The high quality results for smaller objects truly outperforms other libraries I've tried. On the other hand, as objects get larger things become unusable and I need to switch to a different library.

jannessm commented 3 years ago

Sorry for the late response. The suggestion is a worth thinking about it. But I see one large problem: I am currently using a min-heap. Therefore, I need to "sort" it each time. Meaning that I need to find the next min error after removing the current root (min element). One could replace it with a fully sorted list. This will increase the init process of creating the data structure. Maybe, one could implement it for batches > 1 by looking at the childs (v1, v2) of the root, select the smaller one contract it (say v1). Afterwards, compare v2 to the children of v1 and contract the smallest. and so on. But maybe one has to recreate the tree structure after one batch.

What do you think?

jannessm commented 3 years ago

currently, after each interation O(log n) is needed to resort the heap. good sorting algorithms use O(n log n) to sort an array. But they need O(n) memory, too. The init process of the heap takes O(n / 2 log n). A quick research on trees brought me to wikipedia: https://en.wikipedia.org/wiki/Pairing_heap#Summary_of_running_times

So replacing the current heap with a pairing heap could be beneficial and simplify the batch implementation. If you want to, please give it a try ;).

jannessm commented 3 years ago

But for the required memory, I don't see any opportunity to improve it further more.