Open MTCoster opened 11 months ago
I tried this change on a related parser combinator library where the bench is parsing a very large json file
I tried this change on a related parser combinator library where the bench is parsing a very large json file
Iterator: 9.5 ms
Equality: 11.6 ms
Would you mind sharing the project and JSON file? I'd love to poke around at the binaries to see what's going on
Would you mind sharing the project and JSON file? I'd love to poke around at the binaries to see what's going on
I ran cargo bench --bench json -- basic/canada
Something I suspect is what your code gets inlined into has as much affect as your actual code on performance.
A possible alternative experiment is to not inline the compare call. Those are a mix of inherited from nom or added later. In a lot of cases, they dramatically helped with performance but I've also found cases where they hurt.
This is interesting, and it's always a good idea to revisit old optimisations to see if they still hold up. I'll look a bit into it. I suspect here that if the simpler version is slower, it is due to the overhead of calling into bcmp. Compare
in nom is mainly used for very short strings, so for those a small loop might be faster
Back in 2018, a commit of "various optimisations" (8a977eea083487cea36c495192a1239682926ace) was added. One of the changes made was to switch
impl Compare<&[u8]> for &[u8]
(lifetimes omitted) from using a straight byte slice comparison to an iterator-based approach.However, I'm not convinced that this was actually an optimisation (or at least if it was at one point, I don't think it is anymore).
An analysis
Godbolt link for reference.
Let's look at two functions - one a straight copy of the existing code (
compare_current
), and one almost an exact copy (now usingstd::cmp::min()
) of the original ("unoptimised") code (compare_simpler
). I've replaced&self
with another parameter to make these standalone functions.On x86_64, these are compiled (at
-Coptlevel=3
) to:The gist here is that the current implementation compiles to a tight fully-rolled loop, while the simpler implementation uses a single
bcmp
(an LLVM builtin, lowered frommemcmp
since LLVM 9).I have not done my own micro-benchmarking of these two functions, but it does seem unlikely that a tight loop can perform better than the compiler's solution.
Another (more interesting?) comparison can be made when
b
has a fixed size. I also had to makea
a fixed size to make the compiler change anything though, so this may be irrelevant for the use case ofCompare
. I suspect this is because any attempt at optimising the comparison for a variable-sized input would be duplicating the implementation of a not insignificant part ofbcmp
and thus is pointless, since such an implementation could be inlined by LTO.Given these two wrapper functions, rust always inlines the body:
Uhh...
It seems as though the loop made LLVM unable to reason about the intent and the best it could do was just unroll it rather than fully optimise it in the same way it could the simpler implementation.
For the curious, the story is basically the same on
aarch64
, except thememcmp
isn't lowered to the (potentially faster)bcmp
for some reason.