javagl / Obj

A simple Wavefront OBJ file loader
Other
181 stars 38 forks source link

ObjData buffer generation performance optimizations (about 2x as faster) #30

Open PhpXp opened 1 year ago

PhpXp commented 1 year ago

It's faster to create an array with data and then bulk put() it to a buffer. For loops were also optimized. Code was tested on an Android phone and after the optimisations, buffers are created about 2x faster.

Public function signatures were changed, hopefully this isn't a big problem. For example in getFaceVertexIndices(), you shouldn't be adding one value at a time to the buffer anyway, that's slow. Note that the code isn't final, Javadoc still has to be updated.

javagl commented 1 year ago

I may have to take a closer look at the exact changes. From quickly skimming over the changes, it seems to be related to two things that one could call "high-level implementation decisions":


1:

When dealing with such "bulk" data, it can often be useful to have two flavors of functions. One that allocates a new buffer, and one that writes the result into a given buffer. For example, as in FloatBuffer vertices = something.getVertices(); and

FloatBuffer vertices = allocate();
something.getVertices(vertices);

An aside: It often makes sense to combine this into a "hybrid" function, like

// Fills the target and returns it
// When the target is null, then a new buffer is created, filled, and returned
FloatBuffer getVertices(FloatBuffer target) {
    FloatBuffer result = target;
    if (result == null) result = allocate();
    ...
    return result;
}

but this isn't the pattern that is used here.

You changed the function

public static void getVertices(ReadableObj obj, FloatBuffer target) { ... }
// to
public static void getVertices(ReadableObj obj, float[]     target) { ... }

First of all, that's a breaking API change and I won't accept this in this exact form. But of course it would be easy to work around that: There could be the original getVertices function, and - following the naming pattern that is already used in general - there could be another one like getVerticesArray(... float[] target).

The change is not only breaking, but also reduces the versatility. With the FloatBuffer one, you could have

FloatBuffer verticesTwice = allocate(12 * 2);
ObjData.getVertices(obj, verticesTwice);
ObjData.getVertices(obj, verticesTwice);

e.g. for filling the buffer with vertices from multiple OBJs. In order to "emulate" this possibility, you'd have to change the method to

public static void getVerticesArray(
    ReadableObj obj, 
    float[] target,
    int offset) // Add an 'offset' parameter, saying where to start writing into the array

So the FloatBuffer one is more flexible in this regard. And this flexibility leads to the other design option:


2:

When dealing with "bulk data of primitive values", then there usually are the options of using a primitive array like float[], or a buffer like FloatBuffer. And the aforementioned flexibility is related to the question (that I always try to think about, not only in this particular case):

When there are two design options A and B, which one can emulate the other?

And as a rule of thumb:

The point is (and that you may have noticed in the methods of the ObjData class that you modified):

A: When you have a method that fills a FloatBuffer target, then it's trivial to fill a float[] array with that:

float[] array = new ...;
fill(FloatBuffer.wrap(array));

B: When you have a method that fills a float[] target array, then ... it can be a little bit quirky and inconvenient when you want to fill a FloatBuffer...

FloatBuffer buffer = alloc();
float array[] temp = new ...;
fill(temp, 0);
buffer.put(temp);

(allocating the data twice, just to shove it from the array into the buffer)

An aside: For buffers, there's the additional design dimension of whether you want to heap-allocated buffers or 'direct' buffers. For anything that is related to OpenGL/native libraries, you usually want 'direct' buffers...


A detail:

The float[] getValues(); method in the FloatTuple looks innocent. After all, "it's just a getter"....

But ... it makes the state mutable. You could call tuple.getValues()[1] = Float.NaN; Until now, the FloatTuple is immutable.

Beyond that: I know that it's very unlikely that there will ever be a different implementation of the FloatTuple interface than the DefaultFloatTuple. But imagine there was an implementation that did not store the values in a float[] array. If there was an implementation that was based on a FloatBuffer (or if I wanted to change the DefaultFloatTuple to store the values in a FloatBuffer for some reason), then this method would bite back, painfully:

    @Override
    public float[] getValues() {
        float[] result = new float[3]; // Potentially called millions of times...
        result[0] = data.get(0);
        result[1] = data.get(1);
        result[2] = data.get(2);
        return values;
    }

This would probably eat up many performance gains that one might achieve elsewhere...

This leads to the main point:


The main claim of this PR is to be a "performance optimization" and make things "2x as faster" (sic).

I doubt that :)

Performance tests are difficult. Always. And they are particularly difficult on a Virtual Machine. And they are even more difficult on a Virtual Machine that has a Just-In-Time-Compiler.

But as a very quick test (to be taken with a grain of salt for the above reasons), I created this:

import java.io.FileInputStream;
import java.nio.FloatBuffer;
import java.nio.IntBuffer;

import de.javagl.obj.Obj;
import de.javagl.obj.ObjData;
import de.javagl.obj.ObjReader;

public class ObjPerformanceTestPR30
{
    public static void main(String[] args) throws Exception
    {
        String s = "example.obj";
        Obj obj = ObjReader.read(new FileInputStream(s));

        double sum = 0;
        long before = System.nanoTime();
        for (int i=0; i<1000; i++)
        {
            IntBuffer indices = ObjData.getFaceVertexIndices(obj);
            FloatBuffer vertices = ObjData.getVertices(obj);
            sum += indices.get(0);
            sum += indices.get(1);
            sum += vertices.get(0);
            sum += vertices.get(1);
        }
        long after = System.nanoTime();
        System.out.println("Took "+(after-before)/1e6+" ms with sum "+sum);
    }

}

I ran this with some arbitrary ~10MB OBJ file that I had lying around here.

With master, the output for me was Took 2271.6398 ms with sum -7199.090003967285

With this branch, the output for me was Took 2494.5546 ms with sum -7199.090003967285

Of course, the results on another VM (e.g. on Android) may be completely different. But still, I'd like to see the claim that the new approach is "noticably faster" (on most or some VMs) to be backed by examples...

PhpXp commented 1 year ago

Thanks for the long write up. I expected a response like this, completely understandable. I'm surprised by your benchmark results. I'll do my own tests on Android again, hopefully later this week.

javagl commented 1 year ago

I'm curious how large the difference will be on a different VM. And I know that that "performance test" that I created there could be questioned in many ways. It follows the most simple form of the usual pattern for such basic benchmarks. Namely: To repeat the operation several (1000) times, to give the JIT a chance to kick in. But one could argue here: That may not be how these functions are usually used. When someone is extracting the data from an OBJ, then this may often be something that is done once - after the OBJ is loaded, and before the data is sent to the renderer.

So if there was ...

then the first one would appear to be faster in such a benchmark, but ... that initial delay of 500ms may be undesirable.

Sooo.... all this is to be taken with a grain of salt. It may make sense to switch the FloatBuffer<->float[] uses at some places, but for now, the evidence that this is worth it is just not strong enough...

PhpXp commented 1 year ago

It seems like the performance depends on the platform (Android/Windows) and the size of the model. I ran benchmarks on multiple phones with a small file (about 500KB) and a big one (about 80MB).

X axis are individual consecutive runs, Y axis is time in nanoseconds.

Results from Google Pixel 3 XL: image image

Results from Asus Zephyrus G14 (AMD Ryzen 9 4900HS): image image

The results from the Pixel 3 XL match other phones I also tried (Samsung Galaxy S22 (Exynos variant), Sony Xperia XCompact, Samsung Galaxy XCover 5). It seems like my optimizations are better only in specific cases, but they do make a big difference there.

javagl commented 1 year ago

Impressive. I didn't expect these details.

(And while I do have some experience with much of that (Java, performance testing, working of the JIT, a bit of JMH, VisualVM, JFR...), I have to admit that I do not have any knowledge about the specifics of Android. Only knowing that ~"there's that thing (called 'Dalvik' IIRC) that serves as the VM, and is not the OpenJDK/Oracle HotSpot VM" caused me to assume that the results may vary vastly...)

I assume that the charts that do not say "big model" are done with the small model.

Sooo... eventually the trade-off here, summarized very roughly: It could be ~60% faster for small models on Android, but ~25% slower for large models on Desktop.

One important question: Are these runs with the test that I posted above, or did you create an own test/benchmark class? As I said: That test was only quickly ~"written down", to have a ballpark estimate. I wonder if one should try to create other test cases, or try out other patterns (like the get(targetBuffer) versions of the methods instead of the buffer = get() versions or so...)

I appreciate the effort that you put in all of this, but I currently don't exactly know what's the best way to to proceed here. (Iff the slowdown for large models could be reduced (and API compatibiltiy could be ensured), that would be a big plus, but there still are many unknowns for me...)

PhpXp commented 1 year ago

there's that thing (called 'Dalvik' IIRC) that serves as the VM

Dalvik is the old Runtime. ART (Android Runtime) is the new one. Briefly: it combines and interpreter, JIT, and AOT (ahead of time) compilation. There are also baseline profiles, which is a list of methods that are called frequently enough to be additionally optimized. If the device has Google Play Services installed, those profiles are shared between all devices globally. Pretty advanced stuff.

I assume that the charts that do not say "big model" are done with the small model.

That is correct.

A wild guess is that this could be caused by an increased GC load

I can confirm that this is the case. Garbage collections are reported to the LogCat and on a low-end device (XCover 5) the blocking GC can pause execution for upwards of 300ms. Clearly not optimal. I wonder if there is an easy solution to this, to avoid allocating a new array every time, but still get the benefits of the bulk put. Maybe a smaller array (like 128 bytes) inserted into the IntBuffer multiple times instead of the full-size array?

The 26% slower for the bit model on Desktop does not look so good...

It could be that IntBuffer is more optimal on desktop VMs or it could be the GC. I'll work on this tomorrow.

Are these runs with the test that I posted above, or did you create an own test/benchmark class?

I wrote the test in Kotlin, but in principle it does the same thing:

val filenames = listOf(
    "obj_files/4506B/4506B.obj",
    "obj_files/4507B-B001-B002/4507B.obj",
    "obj_files/4507B003/4507B003.obj",
    "obj_files/4507B004-B005-B006/4507B004.obj",
    "obj_files/4508B/4508B.obj",
    "obj_files/4524B/4524B.obj",
    "obj_files/4528B/4528B.obj",
    "obj_files/4529B/4529B.obj",
    "obj_files/4535B/4535B.obj",
)

val files = filenames.map {
    val inputStream = javaClass.classLoader!!.getResourceAsStream(it)
    ObjReader.read(inputStream)
}

//val files = listOf("bugatti/bugatti.obj").map {
//    val inputStream = javaClass.classLoader!!.getResourceAsStream(it)
//    ObjReader.read(inputStream)
//}

repeat(500) {
    var totalTime = 0L
    var sum = 0f
    var time = System.nanoTime()
    for(obj in files) {
        val indices = ObjData.getFaceVertexIndices(obj)
        val vertices = ObjData.getVertices(obj)

        sum += indices.get(0)
        sum += indices.get(1)
        sum += vertices.get(0)
        sum += vertices.get(1)

        val now = System.nanoTime()
        totalTime += (now - time)
        time = now
    }
    println("TOTAL TIME $totalTime")
}
javagl commented 1 year ago

Thanks for these updates.

A few high-level points about how strongly I'd consider to merge this PR:

The latter may affect the performance, but I don't have a clue how much. I'll probably try to allocate some time (maybe during the weekend) to play with all this in more detail (although I'd only run tests on Desktop - so my focus would roughly be to keep the functionality that makes it ~60% faster for the small/Android case, and try to avoid the 25% slowdown for the large/Desktop case...)

I wonder if there is an easy solution to this, to avoid allocating a new array every time, but still get the benefits of the bulk put. Maybe a smaller array (like 128 bytes) inserted into the IntBuffer multiple times instead of the full-size array?

The GC is a tricky beast. And these GC pauses may very well be an artifact of the artificial test: Of course there's a lot of garbage generated when val vertices = ObjData.getVertices(obj) is called 500 times and the result is thrown away immediately. Maybe I'll also try to come up with a "more realistic" benchmark, but ... that's tricky (I have no idea how exactly people are using this library...).

I already tried to avoid generating garbage inside the Obj library. Iff GC pauses are caused by garbage that is generated in the library, one could try to avoid that, with some ThreadLocal<float[]> scratchBuffer or whatnot, but I don't have a place in mind where this actually happens...