Open Walnut356 opened 1 year ago
Updated for a more direct get
method - i.e. just dereference the pointer directly since we already have access to it. This eliminates the dead code as described above, but also has slightly better code gen with less redundant checks. From what I can tell, the performance characteristics are just about identical. The CPI is slightly worse, but there's less instructions and less jumping, so it seems to even out.
For the hot functions in one of my libraries, this change reduces code size by quite a decent amount.
(All examples compiled via 1.73.0-x86_64-pc-windows-msvc in release mode)
Is this still applicable? I'm trying to reproduce this with cargo-show-asm
and the outputs are very similar:
I guess the newer LLVM version is already smart enough to optimize it?
I think so. It mostly has to do with how things get inlined when get_x
is called lots of times in succession. I did a quick check and the codegen before/after rebasing was ~identical. I ran it through the same function I did last time (which, admittedly is a pretty extreme example since it's a few dozen .get_X
calls in a row in a tight loop). The source for that is here. All compilations were done on stable
The major points of interest are the branch codegen still being 33% less instructions than current master, and the section between .LBB471_22
and .LBB471_44
which is ~400 straight instructions without any branching on the fast path, whereas the current master still compiles with Jcc
-> JMP
combos necessitating at least a few small jumps.
Without this change, there's quite a bit of dead code generated when using these functions with a Bytes object. This is because:
These functions are nearly always inlined in their entirety (including the call to
self.copy_to_slice
)The entire conditional:
seems to be meant to handle non-contiguous memory. Bytes is guaranteed to be contiguous, so whenever the
else
block is triggered it immediately panics onself.copy_to_slice
's first assert statement:assert!(self.remaining() >= dst.len());
Due to the above, the generated assembly for multiple sequential
bytes.get_x
calls is pretty messy and it doesn't look like the compiler is having a fun time with it. This simple change removes the conditional and just has a raw panic on aNone
being returned fromslice.get
, cleaning up the generated code a lot and allowing for a lot more optimizations (from what i observed).I also copied the
Buf
get_u16
tests over to the bytes tests to confirm the behavior still works as expected.