rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.52k stars 12.74k forks source link

Enum field align cause performance degradation about 10x #119247

Open PureWhiteWu opened 10 months ago

PureWhiteWu commented 10 months ago

Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.

The FastStr itself is a newtype of the inner Repr, which previously has the following layout:

#[derive(Clone)]
pub type FastStr(Repr);

#[cfg(all(test, target_pointer_width = "64"))]
mod size_asserts {
    static_assertions::assert_eq_size!(super::FastStr, [u8; 40]); // 40 bytes
}

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

Playground link for old version

After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to the Repr::Inline variant(change the type of len from u8 to usize), the performance of clone a Repr::Empty(and other variants all) boosts about 9x from 40ns to 4ns. But the root cause is still not clear:

const INLINE_CAP: usize = 24; // This is becuase I don't want to enlarge the size of FastStr

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: usize, buf: [u8; INLINE_CAP] },
}

Playground link for new version

A simple criterion benchmark code for the old version:

use bytes::Bytes;
use std::sync::Arc;
use criterion::{black_box, criterion_group, criterion_main, Criterion};

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

fn criterion_benchmark(c: &mut Criterion) {
    let s = Repr::Empty;
    c.bench_function("empty repr", |b| b.iter(|| black_box(s.clone())));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

For a full benchmark, you may refer to: https://github.com/volo-rs/faststr/blob/main/benches/faststr.rs

Related PR: https://github.com/volo-rs/faststr/pull/6 And commit: https://github.com/volo-rs/faststr/commit/342bdc95e6d4f599911ce9b5bc566d77b1ca75a7

Furthermore, I've tried the following methods, but none helps:

  1. only change INLINE_CAP to 24
  2. change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
  3. change INLINE_CAP to 22 and add a new struct Inline without the _pad field

To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.

Edit: related discussions users.rust-lang.org, reddit

Noratrieb commented 10 months ago

Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)] may help as well.

PureWhiteWu commented 10 months ago

Without looking at it with much detail, I suspect this is caused by niche optimizations triggering, introducing more complicated branches. #[repr(C)] may help as well.

Hi, thanks very much for your reply! Do you mean the compiler alignment may lead to some optimizations being disabled?

Noratrieb commented 10 months ago

I tried to minimize it into a godbolt: https://godbolt.org/z/zaeGe9Evh

Noratrieb commented 10 months ago

Cleaning up the assembly by hand:

before ```asm do_clone: movzx eax, byte ptr [rsi] lea rcx, [rip + .LJTI0_0] movsxd rax, dword ptr [rcx + 4*rax] add rax, rcx jmp rax .Empty: xor eax, eax mov byte ptr [rdi], al mov rax, rdi ret .Bytes: mov rax, qword ptr [rsi + 8] mov qword ptr [rdi + 8], rax mov al, 1 mov byte ptr [rdi], al mov rax, rdi ret .ArcStr: mov rax, qword ptr [rsi + 8] mov rcx, qword ptr [rsi + 16] lock inc qword ptr [rax] jle .LBB0_10 mov qword ptr [rdi + 8], rax mov qword ptr [rdi + 16], rcx mov al, 2 mov byte ptr [rdi], al mov rax, rdi ret .ArcString: mov rax, qword ptr [rsi + 8] lock inc qword ptr [rax] jle .LBB0_10 mov qword ptr [rdi + 8], rax mov al, 3 mov byte ptr [rdi], al mov rax, rdi ret .StaticStr: movups xmm0, xmmword ptr [rsi + 8] movups xmmword ptr [rdi + 8], xmm0 mov al, 4 mov byte ptr [rdi], al mov rax, rdi ret .Inline: movzx eax, byte ptr [rsi + 1] mov rcx, qword ptr [rsi + 32] mov qword ptr [rdi + 32], rcx movups xmm0, xmmword ptr [rsi + 18] movups xmmword ptr [rdi + 18], xmm0 movups xmm0, xmmword ptr [rsi + 2] movups xmmword ptr [rdi + 2], xmm0 mov byte ptr [rdi + 1], al mov al, 5 mov byte ptr [rdi], al mov rax, rdi ret .unreachable: ud2 ud2 .LJTI0_0: .long .LBB0_1-.LJTI0_0 .long .LBB0_2-.LJTI0_0 .long .LBB0_3-.LJTI0_0 .long .LBB0_5-.LJTI0_0 .long .LBB0_7-.LJTI0_0 .long .LBB0_8-.LJTI0_0 ```
after ```asm do_clone_after: mov rax, qword ptr [rsi] lea rcx, [rip + .LJTI1_0] movsxd rdx, dword ptr [rcx + 4*rax] add rdx, rcx jmp rdx .Empty: mov qword ptr [rdi], rax mov rax, rdi ret .Bytes: mov rcx, qword ptr [rsi + 8] mov qword ptr [rdi + 8], rcx mov qword ptr [rdi], rax mov rax, rdi ret .ArcStr: mov rcx, qword ptr [rsi + 8] mov rdx, qword ptr [rsi + 16] lock inc qword ptr [rcx] jle .LBB1_5 mov qword ptr [rdi + 8], rcx mov qword ptr [rdi + 16], rdx .ArcString: mov rcx, qword ptr [rsi + 8] lock inc qword ptr [rcx] jle .LBB1_5 mov qword ptr [rdi + 8], rcx mov qword ptr [rdi], rax mov rax, rdi ret .StaticStr: movups xmm0, xmmword ptr [rsi + 8] movups xmmword ptr [rdi + 8], xmm0 mov qword ptr [rdi], rax mov rax, rdi ret .Inline: mov rcx, qword ptr [rsi + 8] mov rdx, qword ptr [rsi + 32] mov qword ptr [rdi + 32], rdx movups xmm0, xmmword ptr [rsi + 16] movups xmmword ptr [rdi + 16], xmm0 mov qword ptr [rdi + 8], rcx mov qword ptr [rdi], rax mov rax, rdi ret .unreachable: ud2 ud2 .LJTI1_0: .long .LBB1_9-.LJTI1_0 .long .LBB1_1-.LJTI1_0 .long .LBB1_2-.LJTI1_0 .long .LBB1_4-.LJTI1_0 .long .LBB1_6-.LJTI1_0 .long .LBB1_7-.LJTI1_0 ```
Noratrieb commented 10 months ago

-Zprint-type-sizes

print-type-size type: `before::Repr`: 40 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `Inline`: 39 bytes
print-type-size         field `.len`: 1 bytes
print-type-size         field `.buf`: 38 bytes
print-type-size     variant `ArcStr`: 23 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 16 bytes, alignment: 8 bytes
print-type-size     variant `StaticStr`: 23 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 16 bytes, alignment: 8 bytes
print-type-size     variant `Bytes`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `ArcString`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `Empty`: 0 bytes
print-type-size type: `after::Repr`: 40 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Inline`: 32 bytes
print-type-size         field `.len`: 8 bytes
print-type-size         field `.buf`: 24 bytes
print-type-size     variant `ArcStr`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `StaticStr`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `Bytes`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `ArcString`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Empty`: 0 bytes

before gets a lot more padding, which is probably correlated with the more verbose assembly.

Noratrieb commented 10 months ago

Looking at the assembly more, I think the issue here comes from the alignment allowing the discriminant to be bigger. The bigger discriminant causes less code to be emitted because it doesn't have to bother carefully setting just one byte to zero, it can just write back the discriminant that it read. Even though it would be allowed to write a full 8 byte discriminant for every variant except Inline, it never does.

PureWhiteWu commented 10 months ago

@Nilstrieb Hi, thanks very much for your investigation and explanation! I wonder the alignment issue may even cause the Empty variant clone cost boost from 4ns to 40ns?

the8472 commented 10 months ago

You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. users.rust-lang.org, reddit

PureWhiteWu commented 10 months ago

You posted this in at least 3 different places. It would be good to link to the others to avoid duplicated effort. users.rust-lang.org, reddit

Thanks very much! I have added these links to the description!

LYF1999 commented 10 months ago

there is the output of read_volatile. https://godbolt.org/z/rznoTfevb maybe it's because the difference of instructions

LYF1999 commented 10 months ago

hm.... this is the llvm type of two versions

%"after::Repr" = type { i64, [4 x i64] }
%"before::Repr" = type { i8, [39 x i8] }

maybe the layout of before::Repr is too bad?

alexpyattaev commented 8 months ago

Given that you only have like 6 variants in the enum, and you need the discriminant for the enum anyway, why not just make like 24 additional versions of InLine for each possible length? You'd save a bunch of size this way as well, and I'm quite certain the resulting thing will be easier to optimize when your inline length is known at compile time.