Change: Use simdjson's ondemand API

spnda commented 2 years ago

In 0fe58d757627d0c3cac9ad36a5e60b29281483d9 we switched from on-demand -> DOM API as it was slightly faster. This switches back to the on-demand API while using a different approach to parsing to hopefully increase speeds over the current code using the DOM API.

Design

The first iteration of the new parsing interface also helped with the DOM API, which is why that was merged separately already. However, this patch also tries to port that idea of iterating over all fields linearly in every function. Because currently, we only iterate over fields for the root of the JSON document and use hashed strings to speed up the switch. For every other function the order of fields may still be in the exact opposite (worse case) as how we read from them, which is why I want to use this idea everywhere in fastgltf.

Performance

Performance has gotten worse which is why this is a draft PR and I will look into this drawback later on. From a quick profiling run it seems like the ondemand::parser::iterate function is taking a considerable amount of time while the rest executes within at most 500microseconds.

Mean vs Buggy gltf (16k lines, no embedded buffers or images) load from memory (3)

lemire commented 1 year ago

It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.
The count_elements() that the code keeps computing are probably harming.
Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.
Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).

spnda commented 1 year ago

It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.

The count_elements() that the code keeps computing are probably harming.

Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.

Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).

I've switched to include the simdjson header in fastgltf_parser.hpp. However, I really don't want to expose any headers that I don't actually need to expose. In this case, I use unique_ptr for anything from simdjson to essentially use PIMPL, which has worked amazingly so far. And as simdjson::ondemand is an alias for the builtin implementation, I can't just set it to anything as you've said. Is there any recommended way of forward declaring the stuff?
I've removed those, and by looking at some data from SIMDJSON_VERBOSE_LOGGING I noticed that it was fully reading the whole file searching for a field I was using to "verify" some basics of the file (asset info, extension info). I removed that part and I managed to reduce the runtime (for one specific glTF file) from 5.2ms average to 4.6ms average, which is a lot quicker but actually still slower than the 4.2ms with the DOM implementation.
See above. I saw a gazillion lines of "skip" before it even started reading anything.
I'm on my Windows machine right now, so I can't give much info. I did use Tracy once to analyze a few things in simdjson and found that stage1 was sometimes taking up significantly more than half the time (which seemed a bit suspect). But I will re-test on my Mac in a few days at most, so that I can get back to the simdjson issue with a bit more information on what takes the longest and is the biggest issue in my code.

Also sorry for the late response, I had a lot of stuff to do for school. I really appreciate your help.

lemire commented 1 year ago

Thanks.

The functions that have to do with simdjson don’t have to be declared in your header file, do they?

Or, at least, they don’t have to be in a public header file?

lemire commented 1 year ago

I'm on my Windows machine right now, so I can't give much info.

If you are compiling under Windows, I recommend using Clang if possible. See

https://lemire.me/blog/2023/03/03/float-parsing-benchmark-regular-visual-studio-clangcl-and-linux-gcc/

spnda commented 1 year ago

If you are compiling under Windows, I recommend using Clang if possible. See

https://lemire.me/blog/2023/03/03/float-parsing-benchmark-regular-visual-studio-clangcl-and-linux-gcc/

Thanks for the read. I've gone ahead and I've just run my benchmarks using Clang-CL 16 (I used clang-cl from the LLVM downloads). It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch. Though this PR's branch is still lacking behind consistently.

lemire commented 1 year ago

It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch.

I see something like a factor of two... but yeah, Visual Studio has disappointing performance in some cases I care about.

spnda commented 9 months ago

@lemire I've revisited this PR because I want to add support for #41, which I think I want to do using raw JSON tokens, as provided by the on demand API.

Using verbose logging I already managed to restructure my code to have no skips at all during the entire parsing process. I've looked through some things using Xcode Instruments, and have found some issues. I fixed some of the issues with allocations and vector resizing where it wasn't needed, but this one stumps me (there's more than just the preview shown, and ignore the return at the top of the function I accidentally committed that): https://github.com/spnda/fastgltf/blob/fc5b3c2e9a105bed89332f6187b3d3d41a4a69eb/src/fastgltf.cpp#L1330-L1400

This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?

lemire commented 9 months ago

@spnda Having a look.

lemire commented 9 months ago

This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?

The whole function does a lot of complicated work.

If you turn it into something like ...

  auto v = std::get<double_vec>(variant);
  for (auto element : array) {
    // error handing as needed
    v.push_back(element.get_double());
  }

You should be pretty close to be limited by the number parsing speed. If not, then you have other overhead.

My recommendation is to try to break down your code into simpler functions. There is a lot going on.

It may help also to include a standard benchmark as part of the project. If I could run a benchmark easily, I might be able to give you more insights. Just looking at high level code buys only so much. I need to know what the code is doing in practice.

lemire commented 9 months ago

This part is important: I recommend including a standard benchmark as part of your project. If I could just do...

cmake --build build
./build/benchmark

That would tremendously useful. Not just to me but to anyone who cares about performance.

lemire commented 9 months ago

Feel free to steal some of my benchmarking code:

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28/benchmarks

lemire commented 9 months ago

My recommendation is to rely more on experiments and repeated high-quality benchmarks, than on profiling. Profilers mislead us all the time. They may point at a potential direction, but that's all. There are exceptions. If you do pure numerical analysis, you may find 3 lines of code where your code is spending 99% of its time, and then you rewrite them in assembly and you are done... but for a problem such as this one, experiments are very likely to be the way forward. One needs to gain insights into what is costing what.

For example, how much does the get_number() function costs as opposed to get_double()? This takes just one experiment to find it out.

spnda commented 9 months ago

This part is important: I recommend including a standard benchmark as part of your project. If I could just do...
cmake --build build
./build/benchmark
That would tremendously useful. Not just to me but to anyone who cares about performance.

I already include benchmarks (which I am also using right now locally) which use the Catch2 testing framework. I should probably add something about how to build tests and run them in the README...

Though this is how you'd build and run the benchmarks. Note that these do require external downloads of sample assets because I currently just bench an entire run over a JSON asset. I mostly just wrote them to work well for me locally and for the CI.

python fetch_test_deps.py
cmake -B build -DFASTGLTF_ENABLE_TESTS=ON
cmake --build build --target fastgltf_tests
build/tests/fastgltf_tests "[gltf-benchmark]"

(note that the quotation marks in the last line might just be a requirement of zsh which I'm using locally, so you might have to remove those).

lemire commented 9 months ago

I already include benchmarks

The parseAccessors function is a tiny component of this benchmark.

Capture d’écran, le 2024-02-02 à 16 40 02

Capture d’écran, le 2024-02-02 à 16 41 23

lemire commented 9 months ago

@spnda Here is my recommendation.

Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.

Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).

And then compare with the current code.

Lay out the three results. This should tell you a lot about where the time goes.

My generally point is that you need to run experiments, and collect data.

spnda commented 9 months ago

@spnda Here is my recommandation.

Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.

Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).

And then compare with the current code.

Lay out the three results. This should tell you a lot about where the time goes.

My generally point is that you need to run experiments, and collect data.

Well, big thanks for taking the time to look over my project. I'll do some testing along the lines of what you suggested. I'm not knowledgeable about these low-level optimizations, so much of this is still new to me. Using profiling I managed to find a few issues in my code apart from the parseMinMax lambda, which has already had a big impact on the ultimate speed. The original question, of why DOM was faster, is mostly solved actually, or at least on my arm64 M3. I still expect the ondemand code to be quicker than it is currently, but any improvement is welcome right now. And on my 5800X I still have a measurable performance regression in all cases. I guess I'll just have to do more investigation on both chips at the same time to see what differences there are to cover both. Another user had also suggested on Discord that the slowdown might be cache/memory related, which I have yet to test/confirm.

lemire commented 9 months ago

@spnda The reason I raise these questions is that software performance is complex and requires hard numbers and experimentation.

For example, just by tracing the code without too much attention, I noticed that you are often enough storing a single integer in a std::vector which is itself a std::variant. This is fine if that's what you need, but constructing these data structures is expensive. And parsing routine themselves may not carry much weight.

But it is impossible to tell without experimentations and profiling won't tell you.

spnda / fastgltf

Change: Use simdjson's ondemand API #5

Design

Performance