I've been using Intel's VTune to look at the code and found some really
dumb performance improvements that I should have spotted long ago:
The compiler didn't know that the json_parse_state_s* struct that is
passed to all methods will not be modified outwith the current call tree
and so it was constantly reloading state->offset and state->size
members. In the functions where I loop through state->src using the
offset and size I cache the offset and size into function variables,
which then means they get kept in registers for the entire run of the
functions.
Optimized some branches such that our misprediction rate dropped
significantly in some of the hot branches.
In the string parsing functions I support both ' and " as string quotes,
but I was re-checking which quote to use multiple times which caused
branch mispredicts. Instead I store the quote to compare against and
just compare against that.
In json_skip_all_skippables instead of checking the flags_bitset value
for whether C style comments were supported on every iteration of the
loop (the compiler didn't realise flags_bitset wouldn't change!) I check
it once and branch into two separate loops (one that does C style
comment handling, one without).
Reorder the members of json_parse_state_s to group variables together
that are used together such that they appear in the same cacheline.
Change some loops that had switch statements within them such that the
default case of the switch was meant to break out of the switch AND the
loop, to use a loop-local variable and then check this and break the
loop after the switch statement. This helped branch mispredicts and also
the layout of branches to be more sane.
I've been using Intel's VTune to look at the code and found some really dumb performance improvements that I should have spotted long ago: