enitrat commented 11 months ago

Description

Memory is one of the core elements of the EVM. Having it as optimized as possible would reduce the overall costs of all transactions ran through the EVM.

Here is my proposal: We can refactor the memory to be 32-bytes words based instead of the current 16-bytes words based implementation.

I started to implement the new version to run some benchmarks. Below are the related code snippets.

16-bytes words memory (current implementation)

    fn store(ref self: Memory, element: u256, offset: usize) {
        let x = testing::get_available_gas();
        gas::withdraw_gas().unwrap();
        let new_min_bytes_len = helpers::ceil_bytes_len_to_next_32_bytes_word(offset + 32);

        let new_bytes_len = if new_min_bytes_len > self.bytes_len {
            new_min_bytes_len
        } else {
            self.bytes_len
        };
        self.bytes_len = new_bytes_len;

        // Check alignment of offset to bytes16 chunks
        let (chunk_index, offset_in_chunk) = u32_safe_divmod(offset, u32_as_non_zero(16));

        if offset_in_chunk == 0 {
            // Offset is aligned. This is the simplest and most efficient case,
            // so we optimize for it.
            self.items.store_u256(element, chunk_index);
            return ();
        }

        // Offset is misaligned.
        // |   W0   |   W1   |   w2   |
        //     |  EL_H  |  EL_L  |
        // ^---^
        //   |-- mask = 256 ** offset_in_chunk

        let mask: u256 = helpers::pow256_rev(offset_in_chunk);
        let mask_c: u256 = utils::pow(256, 16).into() / mask;

        // Split the 2 input bytes16 chunks at offset_in_chunk.

        let (el_hh, el_hl) = u256_safe_div_rem(
            u256 { low: element.high, high: 0 }, u256_as_non_zero(mask_c)
        );

        let (el_lh, el_ll) = u256_safe_div_rem(
            u256 { low: element.low, high: 0 }, u256_as_non_zero(mask_c)
        );

        // Read the words at chunk_index, chunk_index + 2.
        let w0: u128 = self.items.get(chunk_index.into());
        let w2: u128 = self.items.get(chunk_index.into() + 2);

        // Compute the new words
        let w0_h: u256 = (w0.into() / mask);
        let w2_l: u256 = (w2.into() / mask);

        // We can convert them back to felt252 as we know they fit in one word.
        let new_w0: u128 = (w0_h.into() * mask + el_hh).try_into().unwrap();
        let new_w1: u128 = (el_hl.into() * mask + el_lh).try_into().unwrap();
        let new_w2: u128 = (el_ll.into() * mask + w2_l).try_into().unwrap();

        // Write the new words
        self.items.insert(chunk_index.into(), new_w0);
        self.items.insert(chunk_index.into() + 1, new_w1);
        self.items.insert(chunk_index.into() + 2, new_w2);
        (x - testing::get_available_gas()).print();
    }

32-bytes words memory (optimisation proposal)

    fn store(ref self: Memory, element: u256, offset: usize) {
        let x = testing::get_available_gas();
        gas::withdraw_gas().unwrap();
        let new_min_bytes_len = helpers::ceil_bytes_len_to_next_32_bytes_word(offset + 32);

        let new_bytes_len = if new_min_bytes_len > self.bytes_len {
            new_min_bytes_len
        } else {
            self.bytes_len
        };
        self.bytes_len = new_bytes_len;

        // Check alignment of offset to 32-bytes chunks
        let (chunk_index, offset_in_chunk) = u32_safe_divmod(offset, u32_as_non_zero(32));

        if offset_in_chunk == 0 {
            // Offset is aligned. This is the simplest and most efficient case,
            // so we optimize for it.
            self.items.insert(chunk_index.into(), NullableExtensionTrait::new(element));
            return ();
        }

        // Offset is misaligned.
        // |   W0     |   W1   |
        //     |     EL     |
        // ^---^
        //   |-- mask = 256 ** offset_in_chunk

        let mask: u256 = helpers::pow256_rev(offset_in_chunk);
        let mask_c: u256 = utils::pow(256, 32).into() / mask;

        // Split the input two chunks at offset_in_chunk.

        let (el_h, el_l) = u256_safe_div_rem(element, u256_as_non_zero(mask_c));

        // Read the words at chunk_index, chunk_index + 1.
        let w0: u256 = self.items.get(chunk_index.into()).deref_or_0();
        let w1: u256 = self.items.get(chunk_index.into() + 1).deref_or_0();

        // Compute the new words
        let w0: u256 = (w0.into() / mask);
        let w1: u256 = (w1.into() / mask);

        // We can convert them back to felt252 as we know they fit in one word.
        let new_w0: u256 = (w0.into() * mask + el_h);
        let new_w1: u256 = (el_l.into() * mask + w1);

        // Write the new words

        self.items.insert(chunk_index.into(), NullableExtensionTrait::new(new_w0));
        self.items.insert(chunk_index.into() + 1, NullableExtensionTrait::new(new_w1));
        (x - testing::get_available_gas()).print();
    }
}

Benchmarks

I ran tests on both implementations and computed the actual gas usage of both functions. Here is the output

❯ scarb test -f test_store_should_add_an_element_to_the_memory_with_offset
testing kakarot ...
running 2 tests
[DEBUG(memory_32_bytes_chunks)]                                 (raw: 0x36cfe

[DEBUG(memory_16_bytes_chunks)]                                 (raw: 0x38e78

This leads to a ~4% gas decrease for each store operation

Proposal

Replace the 16-bytes words implementation with the 32-bytes words one.

Eikix commented 9 months ago

Important update: we are gathering some bugs in the Kakarot v0 codebase, we need to make sure each issue and each PR in Kakarot-ssj is aware of the lists of known bugs. Look at this link everytime you take an issue and check your issue isn't targeted by a known bug.

Eikix commented 9 months ago

Important update: we are gathering some bugs in the Kakarot v0 codebase, we need to make sure each issue and each PR in Kakarot-ssj is aware of the lists of known bugs. Look at this tracking issue everytime you take an issue and check your issue isn't targeted by a known bug. Will add this reminder in many places to make sure we keep track of known bugs.

kkrt-labs / kakarot-ssj

dev: optimize memory by using 32-byte words #51

Description

16-bytes words memory (current implementation)

32-bytes words memory (optimisation proposal)

Benchmarks

Proposal