Slow scrypto_decode::<ScryptoValue> for wasm source code

bbarwik commented 7 months ago

Hey,

I was experimenting with fuzzing but I had a problem, it was slow. I spend some time to find the reason, and I found it. The problem is this code:

let value: ScryptoValue = scrypto_decode(&value).unwrap();

especially in system_struct_to_node_substates.

It is used with WASM code which is usually huge, sometimes has over 1M elements. This code has Vec which is parsed with:

            ValueKind::Array => {
                let element_value_kind = decoder.read_value_kind()?;
                let length = decoder.read_size()?;
                let mut elements = Vec::with_capacity(if length <= 1024 { length } else { 1024 });
                for _ in 0..length {
                    elements.push(decoder.decode_deeper_body_with_value_kind(element_value_kind)?);
                }
                Ok(Value::Array {
                    element_value_kind,
                    elements,
                })
            }

So decoder.decode_deeper_body_with_value_kind(element_value_kind) is sometimes called even 1M times which takes a lot of time, 50-100 ms. I tried to optimize it using:

            ValueKind::Array => {
                let element_value_kind = decoder.read_value_kind()?;
                match element_value_kind {
                    ValueKind::U8 => {
                        let length = decoder.read_size()?;
                        let slice = decoder.read_slice(length as usize)?;
                        Ok(Value::Array {
                            element_value_kind,
                            elements: slice.iter().map(|v| Value::U8 { value: *v }).collect(),
                        })
                    }

which makes it 4.5x faster but it's not an elegant solution. But it's hard to make such. The Value::U8 uses 48 bytes of memory, probably allocation of 1M * 48 bytes takes a lot of time. So I don't have an idea how to make it faster than that, maybe you'll find some better idea.

I also include a log from perf showing this issue:

iamyulong commented 7 months ago

Thank you @bbarwik for the detailed analysis and report!

Yes, that u8 array decoding seems to be very slow. And the way you tried to improve is also nice. We applied similar optimization for "regular" Vec<u8> decoding.

To further reduce the memory footprint, we will likely need to split Value::Array into Value::U8Array and Value::NonU8Array. This will create a much more compact in-memory representation of bytes. Although, this doesn't seem to be a trivial change, given the wide use of any Value model.

bbarwik commented 7 months ago

@iamyulong

Adding U8Array to enum would be hard without breaking compatibility.

Also, I made a small mistake, I was benchmarking system_struct_to_node_substates and it was 4.5x faster. When I benchmarked only scrypto_decode it was ~10x faster for 400k elements. I also tried unsafe approach with manual memory allocation:

                        unsafe {
                            let align = mem::align_of::<Value<X, Y>>();
                            let val_size = mem::size_of::<Value<X, Y>>();
                            let layout = std::alloc::Layout::from_size_align(length * val_size, align).unwrap();
                            let raw_memory = std::alloc::alloc(layout) as *mut Value<X, Y>;
                            for i in 0..length as isize {
                                let byte_ptr = raw_memory.offset(i) as *mut u8;
                                ptr::write(byte_ptr, 0x06); // it's U8
                                ptr::write(byte_ptr.offset(1), slice[i as usize]);
                            }
                            Ok(Value::Array {
                                element_value_kind,
                                elements: Vec::from_raw_parts(raw_memory, length, length),
                            })
                        }

and it wasn't faster. So I think that slice.iter().map(|v| Value::U8 { value: *v }).collect() is good enough solution for that case, it should just have some nicer implementation, or maybe even macro to support all from I8 to U128 without huge match, the same should be done for Vec<>, not just Vec<u8>.

The other, smaller problem is when encoding Value::Array, it takes around 1 ms for 400k elements.

            Value::Array {
                element_value_kind,
                elements,
            } => {
                encoder.write_value_kind(*element_value_kind)?;
                encoder.write_size(elements.len())?;
                for item in elements {
                    if item.get_value_kind() != *element_value_kind {
                        return Err(EncodeError::MismatchingArrayElementValueKind {
                            element_value_kind: element_value_kind.as_u8(),
                            actual_value_kind: item.get_value_kind().as_u8(),
                        });
                    }
                    encoder.encode_deeper_body(item)?;
                }
            }

I tested many solutions and this seems to be best one, it's ~4x faster

                    ValueKind::U8 => {
                        let buffer : &mut Vec<u8> = encoder.get_buffer();
                        buffer.reserve(elements.len() + 1024 as usize); // 1024 extra because it's rather not all
                        for item in elements {
                            match item {
                                Value::U8 { value } => {
                                    buffer.push(*value);
                                }
                                _ => {
                                    return Err(EncodeError::MismatchingArrayElementValueKind {
                                        element_value_kind: element_value_kind.as_u8(),
                                        actual_value_kind: item.get_value_kind().as_u8(),
                                    });
                                }
                            }
                        }
                    }

With this 2 optimizations, system_struct_to_node_substates is few times faster than original for simple package.

iamyulong commented 7 months ago

Sounds great! We will add this to the implementation in due course.

dhedey commented 1 month ago

Hi @bbarwik - thanks for raising this!

I've only just seen this - but you raise an excellent point, and it's a reason I've been pushing against ScryptoValue's use in the engine at all (and certainly for critical paths with arbitrary payloads).

Instead, I created RawScryptoValue for these cases, which validates the payload, but leaves it "raw" as bytes (and not even copying the bytes if it doesn't need to). So I expect the function would get a very big speed-up, just swapping out ScryptoValue for RawScryptoValue.

We'll also consider special casing U8 decoding for a 4.5x speed up (although will need a slight tweak to get the depth correct - I've been planning a tweak to depth tracking for a while which will work well alongside your proposal).

radixdlt / radixdlt-scrypto

Slow scrypto_decode::<ScryptoValue> for wasm source code #1708