chore: speed up `read_byte` and `read_extern_fn_prototype`

yjhmelody commented 9 months ago

accessing array with range index is expensive than with position index.

koute commented 9 months ago

Thanks for the PR!

So a few points:

The speed of the code in elf.rs doesn't matter as it's not used by the VM itself, so there's little point in optimizing it, so that should be reverted. (:
The unsafe you introduced should be completely unnecessary as the compiler should have been able to figure that out by itself, so that should be reverted too.
The only part which might make sense is the read_byte, if it speeds things up. (: So have you actually measured a difference?

yjhmelody commented 9 months ago

The only part which might make sense is the read_byte, if it speeds things up. (: So have you actually measured a difference?

I did not measure it. Just take the std source:

unsafe impl<T> SliceIndex<[T]> for ops::Range<usize> {
    type Output = [T];

    #[inline]
    fn get(self, slice: &[T]) -> Option<&[T]> {
        if self.start > self.end || self.end > slice.len() {
            None
        } else {
            // SAFETY: `self` is checked to be valid and in bounds above.
            unsafe { Some(&*self.get_unchecked(slice)) }
        }
    }
//...
}

unsafe impl<T> SliceIndex<[T]> for usize {
    type Output = T;

    #[inline]
    fn get(self, slice: &[T]) -> Option<&T> {
        // SAFETY: `self` is checked to be in bounds.
        if self < slice.len() { unsafe { Some(&*self.get_unchecked(slice)) } } else { None }
    }
//...
}

For most cases, Range checking need to check two bounds. Maybe the start checking can be directly optimized by compiler, but I generally don't assume that.

yjhmelody commented 9 months ago

I wonder if compiler will unroll the loop for nth_arg in 0..arg_count

koute commented 9 months ago

In general if the compiler knows that a slice is e.g. at least 1 element long then doing slice[0] will not have a bounds check anymore.

Anyway, now we do have compilation time benchmarks (see tools/benchtool) so any potential optimization will have to show that it's actually faster to get merged.

koute commented 9 months ago

Okay, so I'll close this for now. Feel free to reopen and/or make another PR if you can speed things up and can show the speedup in the benchmarks. (:

koute / polkavm

chore: speed up `read_byte` and `read_extern_fn_prototype` #67