Open PureWhiteWu opened 10 months ago
Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)]
may help as well.
Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches.
#[repr(C)]
may help as well.
Hi, thanks very much for your reply! Do you mean the compiler alignment may lead to some optimizations being disabled?
I tried to minimize it into a godbolt: https://godbolt.org/z/zaeGe9Evh
Cleaning up the assembly by hand:
-Zprint-type-sizes
print-type-size type: `before::Repr`: 40 bytes, alignment: 8 bytes
print-type-size discriminant: 1 bytes
print-type-size variant `Inline`: 39 bytes
print-type-size field `.len`: 1 bytes
print-type-size field `.buf`: 38 bytes
print-type-size variant `ArcStr`: 23 bytes
print-type-size padding: 7 bytes
print-type-size field `.0`: 16 bytes, alignment: 8 bytes
print-type-size variant `StaticStr`: 23 bytes
print-type-size padding: 7 bytes
print-type-size field `.0`: 16 bytes, alignment: 8 bytes
print-type-size variant `Bytes`: 15 bytes
print-type-size padding: 7 bytes
print-type-size field `.0`: 8 bytes, alignment: 8 bytes
print-type-size variant `ArcString`: 15 bytes
print-type-size padding: 7 bytes
print-type-size field `.0`: 8 bytes, alignment: 8 bytes
print-type-size variant `Empty`: 0 bytes
print-type-size type: `after::Repr`: 40 bytes, alignment: 8 bytes
print-type-size discriminant: 8 bytes
print-type-size variant `Inline`: 32 bytes
print-type-size field `.len`: 8 bytes
print-type-size field `.buf`: 24 bytes
print-type-size variant `ArcStr`: 16 bytes
print-type-size field `.0`: 16 bytes
print-type-size variant `StaticStr`: 16 bytes
print-type-size field `.0`: 16 bytes
print-type-size variant `Bytes`: 8 bytes
print-type-size field `.0`: 8 bytes
print-type-size variant `ArcString`: 8 bytes
print-type-size field `.0`: 8 bytes
print-type-size variant `Empty`: 0 bytes
before gets a lot more padding, which is probably correlated with the more verbose assembly.
Looking at the assembly more, I think the issue here comes from the alignment allowing the discriminant to be bigger. The bigger discriminant causes less code to be emitted because it doesn't have to bother carefully setting just one byte to zero, it can just write back the discriminant that it read. Even though it would be allowed to write a full 8 byte discriminant for every variant except Inline
, it never does.
@Nilstrieb Hi, thanks very much for your investigation and explanation!
I wonder the alignment issue may even cause the Empty
variant clone cost boost from 4ns to 40ns?
You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. users.rust-lang.org, reddit
You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. users.rust-lang.org, reddit
Thanks very much! I have added these links to the description!
there is the output of read_volatile
. https://godbolt.org/z/rznoTfevb
maybe it's because the difference of instructions
hm.... this is the llvm type of two versions
%"after::Repr" = type { i64, [4 x i64] }
%"before::Repr" = type { i8, [39 x i8] }
maybe the layout of before::Repr
is too bad?
Given that you only have like 6 variants in the enum, and you need the discriminant for the enum anyway, why not just make like 24 additional versions of InLine for each possible length? You'd save a bunch of size this way as well, and I'm quite certain the resulting thing will be easier to optimize when your inline length is known at compile time.
Hi, I'm the author of
FastStr
crate, and recently I found a wired problem that the clone cost ofFastStr
is really high. For example, an emptyFastStr
clone costs about 40ns on amd64 compared to about 4ns of a normal String.The
FastStr
itself is a newtype of the innerRepr
, which previously has the following layout:Playground link for old version
After some time of investigation, I found that this is because the
Repr::Inline
part has really great affect on the performance. And after I added a padding to theRepr::Inline
variant(change the type oflen
fromu8
tousize
), the performance of clone aRepr::Empty
(and other variants all) boosts about 9x from 40ns to 4ns. But the root cause is still not clear:Playground link for new version
A simple criterion benchmark code for the old version:
For a full benchmark, you may refer to: https://github.com/volo-rs/faststr/blob/main/benches/faststr.rs
Related PR: https://github.com/volo-rs/faststr/pull/6 And commit: https://github.com/volo-rs/faststr/commit/342bdc95e6d4f599911ce9b5bc566d77b1ca75a7
Furthermore, I've tried the following methods, but none helps:
INLINE_CAP
to 24INLINE_CAP
to 22 and added a padding to the Inline variant:Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
INLINE_CAP
to 22 and add a new structInline
without the_pad
fieldTo change the
INLINE_CAP
to 22 is only for not increasing the size ofFastStr
itself when add an extra padding, so the performance is nothing to do with it.Edit: related discussions users.rust-lang.org, reddit