Open spnda opened 2 years ago
count_elements()
that the code keeps computing are probably harming.
- It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.
- The
count_elements()
that the code keeps computing are probably harming.- Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.
- Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).
simdjson::ondemand
is an alias for the builtin implementation, I can't just set it to anything as you've said. Is there any recommended way of forward declaring the stuff?Also sorry for the late response, I had a lot of stuff to do for school. I really appreciate your help.
Thanks.
The functions that have to do with simdjson don’t have to be declared in your header file, do they?
Or, at least, they don’t have to be in a public header file?
I'm on my Windows machine right now, so I can't give much info.
If you are compiling under Windows, I recommend using Clang if possible. See
If you are compiling under Windows, I recommend using Clang if possible. See
Thanks for the read. I've gone ahead and I've just run my benchmarks using Clang-CL 16 (I used clang-cl from the LLVM downloads). It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch. Though this PR's branch is still lacking behind consistently.
It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch.
I see something like a factor of two... but yeah, Visual Studio has disappointing performance in some cases I care about.
@lemire I've revisited this PR because I want to add support for #41, which I think I want to do using raw JSON tokens, as provided by the on demand API.
Using verbose logging I already managed to restructure my code to have no skips at all during the entire parsing process. I've looked through some things using Xcode Instruments, and have found some issues. I fixed some of the issues with allocations and vector resizing where it wasn't needed, but this one stumps me (there's more than just the preview shown, and ignore the return at the top of the function I accidentally committed that): https://github.com/spnda/fastgltf/blob/fc5b3c2e9a105bed89332f6187b3d3d41a4a69eb/src/fastgltf.cpp#L1330-L1400
This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?
@spnda Having a look.
This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?
The whole function does a lot of complicated work.
If you turn it into something like ...
auto v = std::get<double_vec>(variant);
for (auto element : array) {
// error handing as needed
v.push_back(element.get_double());
}
You should be pretty close to be limited by the number parsing speed. If not, then you have other overhead.
My recommendation is to try to break down your code into simpler functions. There is a lot going on.
It may help also to include a standard benchmark as part of the project. If I could run a benchmark easily, I might be able to give you more insights. Just looking at high level code buys only so much. I need to know what the code is doing in practice.
This part is important: I recommend including a standard benchmark as part of your project. If I could just do...
cmake --build build
./build/benchmark
That would tremendously useful. Not just to me but to anyone who cares about performance.
Feel free to steal some of my benchmarking code:
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28/benchmarks
My recommendation is to rely more on experiments and repeated high-quality benchmarks, than on profiling. Profilers mislead us all the time. They may point at a potential direction, but that's all. There are exceptions. If you do pure numerical analysis, you may find 3 lines of code where your code is spending 99% of its time, and then you rewrite them in assembly and you are done... but for a problem such as this one, experiments are very likely to be the way forward. One needs to gain insights into what is costing what.
For example, how much does the get_number() function costs as opposed to get_double()? This takes just one experiment to find it out.
This part is important: I recommend including a standard benchmark as part of your project. If I could just do...
cmake --build build ./build/benchmark
That would tremendously useful. Not just to me but to anyone who cares about performance.
I already include benchmarks (which I am also using right now locally) which use the Catch2 testing framework. I should probably add something about how to build tests and run them in the README...
Though this is how you'd build and run the benchmarks. Note that these do require external downloads of sample assets because I currently just bench an entire run over a JSON asset. I mostly just wrote them to work well for me locally and for the CI.
python fetch_test_deps.py
cmake -B build -DFASTGLTF_ENABLE_TESTS=ON
cmake --build build --target fastgltf_tests
build/tests/fastgltf_tests "[gltf-benchmark]"
(note that the quotation marks in the last line might just be a requirement of zsh which I'm using locally, so you might have to remove those).
I already include benchmarks
The parseAccessors function is a tiny component of this benchmark.
@spnda Here is my recommendation.
Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.
Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).
And then compare with the current code.
Lay out the three results. This should tell you a lot about where the time goes.
My generally point is that you need to run experiments, and collect data.
@spnda Here is my recommandation.
Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.
Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).
And then compare with the current code.
Lay out the three results. This should tell you a lot about where the time goes.
My generally point is that you need to run experiments, and collect data.
Well, big thanks for taking the time to look over my project. I'll do some testing along the lines of what you suggested. I'm not knowledgeable about these low-level optimizations, so much of this is still new to me. Using profiling I managed to find a few issues in my code apart from the parseMinMax
lambda, which has already had a big impact on the ultimate speed. The original question, of why DOM was faster, is mostly solved actually, or at least on my arm64 M3. I still expect the ondemand code to be quicker than it is currently, but any improvement is welcome right now. And on my 5800X I still have a measurable performance regression in all cases. I guess I'll just have to do more investigation on both chips at the same time to see what differences there are to cover both. Another user had also suggested on Discord that the slowdown might be cache/memory related, which I have yet to test/confirm.
@spnda The reason I raise these questions is that software performance is complex and requires hard numbers and experimentation.
For example, just by tracing the code without too much attention, I noticed that you are often enough storing a single integer in a std::vector which is itself a std::variant. This is fine if that's what you need, but constructing these data structures is expensive. And parsing routine themselves may not carry much weight.
But it is impossible to tell without experimentations and profiling won't tell you.
In 0fe58d757627d0c3cac9ad36a5e60b29281483d9 we switched from on-demand -> DOM API as it was slightly faster. This switches back to the on-demand API while using a different approach to parsing to hopefully increase speeds over the current code using the DOM API.
Design
The first iteration of the new parsing interface also helped with the DOM API, which is why that was merged separately already. However, this patch also tries to port that idea of iterating over all fields linearly in every function. Because currently, we only iterate over fields for the root of the JSON document and use hashed strings to speed up the switch. For every other function the order of fields may still be in the exact opposite (worse case) as how we read from them, which is why I want to use this idea everywhere in fastgltf.
Performance
Performance has gotten worse which is why this is a draft PR and I will look into this drawback later on. From a quick profiling run it seems like the ondemand::parser::iterate function is taking a considerable amount of time while the rest executes within at most 500microseconds.