Closed oyvindln closed 6 years ago
collect()
will call Vec::from_iter
which in turn will call the private function SpecExtend::from_iter
. The trait SpecExtend
is specialized for several situations:
TrustedLen
(← applies here)vec::IntoIter
&T
slice::Iter
.The interesting thing is that, once the specialization for TrustedLen
is removed (so fallback to case 1), the speed of u16
is improved from ~110µs/iter to ~70µs/iter (still much slower than the other two implementations). TrustedLen
specialization is still needed to make u32
and u64
speed good.
u16 | u32 | u64 | |
---|---|---|---|
collect + TrustedLen | ~110µs/iter | ~9µs/iter | ~24µs/iter |
collect + Generic | ~70µs/iter | ~70µs/iter | ~44µs/iter |
manual | ~9µs/iter | ~18µs/iter | ~42µs/iter |
unsafe | ~4µs/iter | ~9µs/iter | ~24µs/iter |
Update: It seems the slowness comes from the loop itself
for element in iterator {
...
}
Looking at the assembly output for this, on stable (1.18), the difference seems to be mostly that the collect case is not vectorized.
On nightly things are a slightly more crazy. The function is vectorized, however there seems to be a lot of other redundant code that would ideally be optimised out. There is a function call to `alloc::allocator::Layout::repeat to calculate the size of the needed allocation which shouldn't be needed for primitive types, and checks to see if that call returned something valid. There is also some exception landing pad stuff (gcc_except_table). That probably doesn't help speed-wise.
The manual version is much cleaner:
EDIT: As mentioned in by kennytm in #43127 it looks like an inlining issue.
Did some more digging into this.
alloc::allocator::Layout::repeat
seems to be inlined properly on the latest nightly (presumably due to #43513) although it doesn't seem to be the main issue here.
It seems llvm is unable to vectorize the loop for the u16 case, but it can for the u32 case, which probably explains the speed difference. I also found that there is a similar slowdown of the collect variant compared doing it manually for u64 when compiling for i686, but not for x86-64.
This code also gives the same complaint about exit count as using collect
(interestingly this is still 4-5 times faster than collect in the benchmark):
pub fn create_with_range() -> Vec<T> {
let mut arr = vec![0;SIZE as usize];
for n in 0..SIZE {
unsafe {
*arr.get_unchecked_mut(n as usize) = n;
}
}
arr
}
Another observation is that according to This comment, that PR would have likely solve the issue. (As TryFrom should now inline properly as a result of #43248) EDIT: Probably due to the added overflow check.
I managed to track down the main issue:
Comparing the llvm-ir with optimisations on, but vectorisation disabled, for u32, llvm is somehow able to deduce that when iterating through the range, range.start < range.end
can be simplified to range.start != range.end
: T
...
%17 = getelementptr inbounds i32, i32* %ptr.0148.i.i.i.i, i64 6
%iter.sroa.0.1.i.i.i.i.6 = add nsw i32 %iter.sroa.0.0147.i.i.i.i, 7
store i32 %iter.sroa.0.1.i.i.i.i.5, i32* %17, align 4, !noalias !4
%18 = getelementptr inbounds i32, i32* %ptr.0148.i.i.i.i, i64 7
; Comparison using equality instruction:
%exitcond.i.i.i.6 = icmp eq i32 %iter.sroa.0.1.i.i.i.i.6, 32767
br i1 %exitcond.i.i.i.6, label %_ZN4core4iter8iterator8Iterator7collect17hc35826f1180bb746E.exit, label %bb35.i.i.i.i
_ZN4core4iter8iterator8Iterator7collect17hc35826f1180bb746E.exit: ; preds = %bb35.i.i.i.i
...
This doesn't seem to happen with u16:
...
store i16 %iter.sroa.0.0.152.i.i.i.i, i16* %ptr.0150.i.i.i.i, align 2, !noalias !4
%12 = getelementptr inbounds i16, i16* %ptr.0150.i.i.i.i, i64 1
%13 = add i64 %local_len.sroa.5.0149.i.i.i.i, 1
; comparison using unsigned less than instruction:
%14 = icmp ult i16 %.iter.sroa.0.0151.i.i.i.i, 32767
%15 = zext i1 %14 to i16
%.iter.sroa.0.0.i.i.i.i = add i16 %15, %.iter.sroa.0.0151.i.i.i.i
br i1 %14, label %bb35.i.i.i.i, label %_ZN4core4iter8iterator8Iterator7collect17h730733b2264e9b53E.exit
...
I first tried to fix this by chaning the implementation of Iter::next
to simply use != instead of <, though this made the compiler segfault when trying to compile stage1 libcore. I don't know if this is a bug somewhere, or if there is something that relies on next using <.
What did help, was to change the impl of add_one()
to
#[inline]
fn add_one(&self) -> Self {
self.checked_add(1).expect("Overflow in step!")
}
This gives me benchmark results of:
running 3 tests
test using_collect ... bench: 4,219 ns/iter (+/- 245)
test using_manual ... bench: 7,045 ns/iter (+/- 122)
test using_unsafe ... bench: 3,550 ns/iter (+/- 52)
Collect is now only slightly slower than the manual implementation using unsafe. (u32 gives the same results as before, i.e same as the manual implementation using unsafe.)
There are still some differences between the code generated using unsafe and using collect for u16, for instance using collect adds some unwinding stuff, and the generated SIMD is a bit different, but it's still a huge improvement.
I think this is solved now with #47944 due to impl<I: TrustedLen> TrustedLen for Take<I>
and TrustedLen
being implemented for Range
. On first sight the assembler of all of those 3 functions looks pretty equal and all are vectorized very well: https://godbolt.org/g/yvTBoE
I think it might be worth performing those benches again on the latest nightly.
I re-ran the benchmarks and the results seem identical between using collect and the unsafe manual version, so it seems this is fixed now.
(At least on x86-64 nightly)
Using code like this: https://is.gd/nkoecB
The version using collect is significantly slower than creating a vec of 0-values and setting the values manually.
On the other hand, if using u32 instead with the same code collect is much better:
Same with u64:
I suspect this may be SIMD-related. Will see if there are similar results on stable.