rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.34k stars 12.72k forks source link

`String::push` is slow #116235

Open lincot opened 1 year ago

lincot commented 1 year ago

An alternative implementation of String::push, which reserves space if needed and then performs push_unchecked, gives a significant performance improvement.

current push:

test bench_push_1_byte  ... bench:      10,927 ns/iter (+/- 147) = 915 MB/s
test bench_push_2_bytes ... bench:      49,662 ns/iter (+/- 373) = 402 MB/s
test bench_push_3_bytes ... bench:      49,780 ns/iter (+/- 102) = 602 MB/s
test bench_push_4_bytes ... bench:      40,556 ns/iter (+/- 349) = 986 MB/s

new push:

test bench_push_1_byte  ... bench:      10,366 ns/iter (+/- 155) = 964 MB/s
test bench_push_2_bytes ... bench:      14,446 ns/iter (+/- 137) = 1384 MB/s
test bench_push_3_bytes ... bench:      19,238 ns/iter (+/- 265) = 1559 MB/s
test bench_push_4_bytes ... bench:      23,922 ns/iter (+/- 213) = 1672 MB/s
bench.rs ```rust #![no_std] #![feature(test)] extern crate alloc; extern crate test; use alloc::string::String; use test::{black_box, Bencher}; // #[inline] // fn push(s: &mut String, ch: char) { // s.push(ch); // } #[inline] fn push(s: &mut String, ch: char) { if s.capacity() - s.len() < ch.len_utf8() { s.reserve(ch.len_utf8()); } unsafe { s.push_unchecked(ch) }; } trait PushUnchecked { unsafe fn push_unchecked(&mut self, value: T); } impl PushUnchecked for String { #[inline] unsafe fn push_unchecked(&mut self, ch: char) { let len = self.len(); let ptr = self.as_mut_vec().as_mut_ptr().add(len); let count = ch.len_utf8(); debug_assert!(len + count <= self.capacity()); match count { 1 => { core::ptr::write(ptr, ch as u8); } 2 => { core::ptr::write(ptr, (ch as u32 >> 6 & 0x1F) as u8 | 0b1100_0000); core::ptr::write(ptr.add(1), (ch as u32 & 0x3F) as u8 | 0b1000_0000); } 3 => { core::ptr::write(ptr, (ch as u32 >> 12 & 0x0F) as u8 | 0b1110_0000); core::ptr::write(ptr.add(1), (ch as u32 >> 6 & 0x3F) as u8 | 0b1000_0000); core::ptr::write(ptr.add(2), (ch as u32 & 0x3F) as u8 | 0b1000_0000); } 4 => { core::ptr::write(ptr, (ch as u32 >> 18 & 0x07) as u8 | 0b1111_0000); core::ptr::write(ptr.add(1), (ch as u32 >> 12 & 0x3F) as u8 | 0b1000_0000); core::ptr::write(ptr.add(2), (ch as u32 >> 6 & 0x3F) as u8 | 0b1000_0000); core::ptr::write(ptr.add(3), (ch as u32 & 0x3F) as u8 | 0b1000_0000); } _ => core::hint::unreachable_unchecked(), } self.as_mut_vec().set_len(len + count); } } const ITERATIONS: u64 = 10_000; #[bench] fn bench_push_1_byte(bencher: &mut Bencher) { const CHAR: char = '0'; assert_eq!(CHAR.len_utf8(), 1); bencher.bytes = ITERATIONS; bencher.iter(|| { let mut s = String::new(); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } #[bench] fn bench_push_2_bytes(bencher: &mut Bencher) { const CHAR: char = 'д'; assert_eq!(CHAR.len_utf8(), 2); bencher.bytes = 2 * ITERATIONS; bencher.iter(|| { let mut s = String::new(); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } #[bench] fn bench_push_3_bytes(bencher: &mut Bencher) { const CHAR: char = '❗'; assert_eq!(CHAR.len_utf8(), 3); bencher.bytes = 3 * ITERATIONS; bencher.iter(|| { let mut s = String::new(); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } #[bench] fn bench_push_4_bytes(bencher: &mut Bencher) { const CHAR: char = '🤨'; assert_eq!(CHAR.len_utf8(), 4); bencher.bytes = 4 * ITERATIONS; bencher.iter(|| { let mut s = String::new(); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } ```

It's worth noting that the current String::push implementation encodes characters into a temporary zero-initialized buffer. The zeroing alone is an unnecessary instruction, but this method is used in several places, notably String::insert and the default Write::write_char.

the8472 commented 1 year ago

Since you already wrote the code, make a PR? Though you might want to look into using vec.spare_capacity_mut() + slice::get_unchecked, which will add some debug asserts compared to the using direct pointer writes.

lincot commented 1 year ago

@the8472 this way? AFAIK get_unchecked_mut doesn't perform debug assertions. We already have a debug_assert! for capacity, though. As to the PR, do we want the new push_unchecked function duplicating char::encode_utf8 code?

the8472 commented 1 year ago

the debug asserts

https://github.com/rust-lang/rust/blob/42faef503f3e765120ca0ef06991337668eafc32/library/core/src/slice/index.rs#L237-L240

the8472 commented 1 year ago

I have no opinion on push_unchecked, you can just make it a private function if you don't have a reason to expose it.

lincot commented 1 year ago

@the8472 it seems that the debug assert only applies if you compile core from source.

As for push_unchecked, I would like to see it in the library as I commonly use it after initializing strings with capacity.

I managed to make it identical to the get_unchecked_mut approach by reusing char::encode_utf8 with a hint.

the8472 commented 1 year ago

If you want to add API you may want to open an ACP first. The internal optimization only needs a PR.

lincot commented 1 year ago

Although the get_unchecked_mut and char::encode_utf8 push_unchecked implementations give identical results when not inlined and with String::push, I have found that they behave very differently in other applications, the latter being slower. Also, the hint is too hacky.

Another solution is to add a new function encode_utf8_raw_unchecked to be called by push_unchecked and encode_utf8_raw. It is identical to the first approach in both applications and is almost the same for char::encode_utf8.

lincot commented 1 year ago

Another thing to note is that this version, unlike the current one, is unable to elide the capacity check in a simple ASCII push loop with a pre-allocated string. Both versions fail on non-ASCII and on long strings, though.

It seems that the current version is able to elide the check because it uses RawVec::reserve_for_push (which has #[inline(never)]) instead of the usual RawVec::reserve in the case of ASCII input. Interestingly, if we replace the reserve call with a panic, it gets elided even in the case of long non-ASCII strings.

If we put the reserve call behind #[inline(never)], it works for a short non-ASCII string, but not for a long one. RawVec::reserve itself performs a check and calls a cold reallocation function when needed. Locally I observed that if #[inline(never)] is applied instead of #[cold] or together with it, it only has a partial effect similar to putting String::reserve behind #[inline(never)].

Despite the issue, my version still outperforms the original in a benchmark with pre-allocated strings.

current push:

test bench_push_1_byte_prealloc           ... bench:      10,170 ns/iter (+/- 85) = 983 MB/s
test bench_push_1_byte_prealloc_blackbox  ... bench:      11,129 ns/iter (+/- 46) = 898 MB/s
test bench_push_2_bytes_prealloc          ... bench:       4,742 ns/iter (+/- 5) = 4217 MB/s
test bench_push_2_bytes_prealloc_blackbox ... bench:      49,587 ns/iter (+/- 41) = 403 MB/s
test bench_push_3_bytes_prealloc          ... bench:       4,744 ns/iter (+/- 5) = 6323 MB/s
test bench_push_3_bytes_prealloc_blackbox ... bench:      49,675 ns/iter (+/- 41) = 603 MB/s
test bench_push_4_bytes_prealloc          ... bench:       4,745 ns/iter (+/- 5) = 8429 MB/s
test bench_push_4_bytes_prealloc_blackbox ... bench:      42,509 ns/iter (+/- 29) = 940 MB/s

new push:

test bench_push_1_byte_prealloc           ... bench:       4,753 ns/iter (+/- 51) = 2103 MB/s
test bench_push_1_byte_prealloc_blackbox  ... bench:      12,419 ns/iter (+/- 83) = 805 MB/s
test bench_push_2_bytes_prealloc          ... bench:       4,750 ns/iter (+/- 27) = 4210 MB/s
test bench_push_2_bytes_prealloc_blackbox ... bench:      11,863 ns/iter (+/- 86) = 1685 MB/s
test bench_push_3_bytes_prealloc          ... bench:       4,748 ns/iter (+/- 26) = 6318 MB/s
test bench_push_3_bytes_prealloc_blackbox ... bench:      16,597 ns/iter (+/- 86) = 1807 MB/s
test bench_push_4_bytes_prealloc          ... bench:       4,751 ns/iter (+/- 27) = 8419 MB/s
test bench_push_4_bytes_prealloc_blackbox ... bench:      21,330 ns/iter (+/- 79) = 1875 MB/s

(you can see a small regression in bench_push_1_byte_prealloc_blackbox, it doesn't happen if the code isn't compiled for the native CPU or if std is compiled from source)

however, both versions work much worse than push_unchecked:

test bench_push_1_byte_prealloc           ... bench:         118 ns/iter (+/- 1) = 84745 MB/s
test bench_push_1_byte_prealloc_blackbox  ... bench:      10,041 ns/iter (+/- 18) = 995 MB/s
test bench_push_2_bytes_prealloc          ... bench:         328 ns/iter (+/- 0) = 60975 MB/s
test bench_push_2_bytes_prealloc_blackbox ... bench:       8,292 ns/iter (+/- 26) = 2411 MB/s
test bench_push_3_bytes_prealloc          ... bench:         491 ns/iter (+/- 2) = 61099 MB/s
test bench_push_3_bytes_prealloc_blackbox ... bench:      11,849 ns/iter (+/- 35) = 2531 MB/s
test bench_push_4_bytes_prealloc          ... bench:         689 ns/iter (+/- 4) = 58055 MB/s
test bench_push_4_bytes_prealloc_blackbox ... bench:      16,566 ns/iter (+/- 33) = 2414 MB/s
bench.rs ```rust #![no_std] #![feature(test)] extern crate alloc; extern crate test; use alloc::string::String; use test::{black_box, Bencher}; // #[inline] // fn push(s: &mut String, ch: char) { // s.push(ch); // } #[inline] fn push(s: &mut String, ch: char) { s.reserve(ch.len_utf8()); unsafe { s.push_unchecked(ch) }; } const TAG_CONT: u8 = 0b1000_0000; const TAG_TWO_B: u8 = 0b1100_0000; const TAG_THREE_B: u8 = 0b1110_0000; const TAG_FOUR_B: u8 = 0b1111_0000; const MAX_ONE_B: u32 = 0x80; const MAX_TWO_B: u32 = 0x800; const MAX_THREE_B: u32 = 0x10000; #[inline] const fn len_utf8(code: u32) -> usize { if code < MAX_ONE_B { 1 } else if code < MAX_TWO_B { 2 } else if code < MAX_THREE_B { 3 } else { 4 } } #[inline] unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [u8]) -> &mut [u8] { let len = len_utf8(code); match len { 1 => { *dst.get_unchecked_mut(0) = code as u8; } 2 => { *dst.get_unchecked_mut(0) = (code >> 6 & 0x1F) as u8 | TAG_TWO_B; *dst.get_unchecked_mut(1) = (code & 0x3F) as u8 | TAG_CONT; } 3 => { *dst.get_unchecked_mut(0) = (code >> 12 & 0x0F) as u8 | TAG_THREE_B; *dst.get_unchecked_mut(1) = (code >> 6 & 0x3F) as u8 | TAG_CONT; *dst.get_unchecked_mut(2) = (code & 0x3F) as u8 | TAG_CONT; } 4 => { *dst.get_unchecked_mut(0) = (code >> 18 & 0x07) as u8 | TAG_FOUR_B; *dst.get_unchecked_mut(1) = (code >> 12 & 0x3F) as u8 | TAG_CONT; *dst.get_unchecked_mut(2) = (code >> 6 & 0x3F) as u8 | TAG_CONT; *dst.get_unchecked_mut(3) = (code & 0x3F) as u8 | TAG_CONT; } _ => core::hint::unreachable_unchecked(), }; dst.get_unchecked_mut(..len) } trait PushUnchecked { unsafe fn push_unchecked(&mut self, value: T); } impl PushUnchecked for String { #[inline] unsafe fn push_unchecked(&mut self, ch: char) { let len = self.len(); let count = ch.len_utf8(); let spare = core::slice::from_raw_parts_mut( self.as_mut_vec().as_mut_ptr().add(len), self.capacity() - len, ); encode_utf8_raw_unchecked(ch as u32, spare); self.as_mut_vec().set_len(len + count); } } const ITERATIONS: u64 = 10_000; #[bench] fn bench_push_1_byte_prealloc(bencher: &mut Bencher) { const CHAR: char = '0'; assert_eq!(CHAR.len_utf8(), 1); bencher.bytes = ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity(ITERATIONS as _); for _ in 0..ITERATIONS { push(&mut s, CHAR); } s // vec![CHAR; ITERATIONS as _] }); } #[bench] fn bench_push_2_bytes_prealloc(bencher: &mut Bencher) { const CHAR: char = 'д'; assert_eq!(CHAR.len_utf8(), 2); bencher.bytes = 2 * ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity((2 * ITERATIONS) as _); for _ in 0..ITERATIONS { push(&mut s, CHAR); } s // vec![CHAR; ITERATIONS as _] }); } #[bench] fn bench_push_3_bytes_prealloc(bencher: &mut Bencher) { const CHAR: char = '❗'; assert_eq!(CHAR.len_utf8(), 3); bencher.bytes = 3 * ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity((3 * ITERATIONS) as _); for _ in 0..ITERATIONS { push(&mut s, CHAR); } s // vec![CHAR; ITERATIONS as _] }); } #[bench] fn bench_push_4_bytes_prealloc(bencher: &mut Bencher) { const CHAR: char = '🤨'; assert_eq!(CHAR.len_utf8(), 4); bencher.bytes = 4 * ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity((4 * ITERATIONS) as _); for _ in 0..ITERATIONS { push(&mut s, CHAR); } s // vec![CHAR; ITERATIONS as _] }); } #[bench] fn bench_push_1_byte_prealloc_blackbox(bencher: &mut Bencher) { const CHAR: char = '0'; assert_eq!(CHAR.len_utf8(), 1); bencher.bytes = ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity(ITERATIONS as _); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } #[bench] fn bench_push_2_bytes_prealloc_blackbox(bencher: &mut Bencher) { const CHAR: char = 'д'; assert_eq!(CHAR.len_utf8(), 2); bencher.bytes = 2 * ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity((2 * ITERATIONS) as _); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } #[bench] fn bench_push_3_bytes_prealloc_blackbox(bencher: &mut Bencher) { const CHAR: char = '❗'; assert_eq!(CHAR.len_utf8(), 3); bencher.bytes = 3 * ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity((3 * ITERATIONS) as _); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } #[bench] fn bench_push_4_bytes_prealloc_blackbox(bencher: &mut Bencher) { const CHAR: char = '🤨'; assert_eq!(CHAR.len_utf8(), 4); bencher.bytes = 4 * ITERATIONS; bencher.iter(|| { let mut s = String::with_capacity((4 * ITERATIONS) as _); for _ in 0..black_box(ITERATIONS) { push(&mut s, black_box(CHAR)); } s }); } ```