lambdaclass / cairo_native

A compiler to convert Cairo's intermediate representation "Sierra" code to MLIR.
https://lambdaclass.github.io/cairo_native/cairo_native
Apache License 2.0
116 stars 43 forks source link

bug: regressions introduced by latest main #831

Open enitrat opened 1 week ago

enitrat commented 1 week ago

main (commit hash to confirm again) introduced a few regressions.

clone https://github.com/kkrt-labs/ef-tests ensure correct branch git checkout dev/bump-native make make setup-kakarot cargo test --test tests stStack --no-fail-fast --features "v1,native,ci" -- --nocapture ^ this will run the full suite for the stStack family. The first issue is also reproducible when running a specific test, with e.g. cargo test test_sar_2_xor_255_256_d0g0v0_Cancun --features v1,native -- --nocapture

  1. Execution regressions

The execution is no longer the same. What used to pass, now fails with:

 WARN ef_testing::models::result: stStackTests::shallowStack_d10g0v0_Cancun reverted:
Transaction execution has failed:
0: Error in the called contract (contract address: 0x0000000000000000000000000000000000000000000000000000000000000003, class hash: 0x06153ccf69fd20f832c794df36e19135f0070d0576144f0b47f75a226e4be530, selector: 0x015d40a3d6ca2ac30f4031e42be28da9b056fef9bb7357ac5e85627ee876e5ad):
Error at pc=0:25:
Cairo traceback (most recent call last):
Unknown location (pc=0:731)
Unknown location (pc=0:677)
Unknown location (pc=0:291)
Unknown location (pc=0:314)

1: Error in the called contract (contract address: 0x04c15a3ff2822ca5b75a67a2ff0226b8b18a94c1c7d396b0073269c5b074d14f, class hash: 0x05b953b2f2ed1e6a4943dc934cc8a5fb12de801745a40857f4459bb375c0f04c, selector: 0x007ec457cd7ed1630225a8328f826a29a327b19486f6b2882b4176545ebdbe3d):
Native execution error: EOA: could not decode tx

thread 'stStackTests::stStackTests::test_shallowStack_d10g0v0_Cancun' panicked at crates/ef-testing/tests/stStackTests.rs:1379:24:
Error while running the test: Other(
gas used mismatch: expected 300000, got 0
nonce mismatch for 0xa94f5374fce5edbc8e2a8697c15331677e6ebf0b: expected                              0x1, got                              0x0
revert reason:
)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test stStackTests::stStackTests::test_shallowStack_d10g0v0_Cancun ... FAILED

^ This happens for practically all tests and highlights a difference somewhere during execution. Probably added by one of the latest commits un the last PR

 WARN ef_testing::models::result: stStackTests::shallowStack_d53g0v0_Cancun reverted:
Transaction execution has failed:
0: Error in the called contract (contract address: 0x0000000000000000000000000000000000000000000000000000000000000003, class hash: 0x06153ccf69fd20f832c794df36e19135f0070d0576144f0b47f75a226e4be530, selector: 0x015d40a3d6ca2ac30f4031e42be28da9b056fef9bb7357ac5e85627ee876e5ad):
Error at pc=0:25:
Cairo traceback (most recent call last):
Unknown location (pc=0:731)
Unknown location (pc=0:677)
Unknown location (pc=0:291)
Unknown location (pc=0:314)

1: Error in the called contract (contract address: 0x04c15a3ff2822ca5b75a67a2ff0226b8b18a94c1c7d396b0073269c5b074d14f, class hash: 0x05b953b2f2ed1e6a4943dc934cc8a5fb12de801745a40857f4459bb375c0f04c, selector: 0x007ec457cd7ed1630225a8328f826a29a327b19486f6b2882b4176545ebdbe3d):
Native execution error: Native execution error: Invalid chain id

thread 'stStackTests::stStackTests::test_shallowStack_d54g0v0_Cancun' panicked at crates/ef-testing/tests/stStackTests.rs:1955:24:
Error while running the test: Other(
gas used mismatch: expected 300000, got 0
nonce mismatch for 0xa94f5374fce5edbc8e2a8697c15331677e6ebf0b: expected                              0x1, got                              0x0
revert reason:
)
test stStackTests::stStackTests::test_shallowStack_d54g0v0_Cancun ... FAILED
  1. a weird SIGRAP issue that terminates the execution:
// ... logs above
error: test failed, to rerun pass `-p ef-testing --test tests`

Caused by:
  process didn't exit successfully: `/Users/msaug/kkrt-labs/ef-tests/target/debug/deps/tests-48dbd4207bfa156f stStack --nocapture` (signal: 5, SIGTRAP: trace/breakpoint trap)

^ This happens when running multiple tests at one - but i don't know exactly which one triggers this

enitrat commented 1 week ago

I ran another set of tests based on 1b8656c1581dc64243cf2dda16240da49aec5b96 and it seems the issue was already present at this commit, i will come back further

azteca1998 commented 1 week ago

Is the write-after-free issue no longer reproducible? I can debug that one while you try to find when this one happened.

enitrat commented 1 week ago

Is the write-after-free issue no longer reproducible?

So far (on https://github.com/lambdaclass/cairo_native/commit/79a8dcffe1e562e80198425e7c9f3a328c783947) I haven't been able to repro. I'm not sure what conditions lead to it happening; perhaps because of parallel threads during test execution

enitrat commented 1 week ago

so, I confirm that the regression was actually added on c866fbf8417cfcf522fd809348a483c5daef2e45. Things work fine when using ae17dd370a7bbf6affeefb9fa6954965e8b52239 commit.

edit: mistake on my end, it comes with https://github.com/lambdaclass/cairo_native/commit/79a8dcffe1e562e80198425e7c9f3a328c783947

azteca1998 commented 1 week ago

Strange. That commit just replaces an assert with logic to handle that assert. Since the test used to pass it means that the assert wasn't triggered, therefore entering the branch that generates the exact same instructions as the previous commit.

There should be no difference in the generated code between before and after this commit.

enitrat commented 1 week ago

🤔 let me retry, perhaps a cache issue. in the meantime, I added repro guidelines based on latest main. It's a bit tedious because i need to update several deps repos and might have lost myself in the process 😓

azteca1998 commented 1 week ago

I think the issue was already there in the commit that updates to LLVM 19... Either that or I don't know how to run your tests.

I had to comment the lines that disable the stack tests in blockchain-tests-skip.yml. Otherwise all tests would be ignored and did nothing if forced to run anyways.

I also tried running make runtime in native at the LLVM 19 update commit and in the current main just in case there was an update there, but with no luck.

enitrat commented 1 week ago

I think the issue was already there in the commit that updates to LLVM 19... Either that or I don't know how to run your tests.

I didn't have the issue in branch fix-clone-drop-for-enum-structs

I had to comment the lines that disable the stack tests in blockchain-tests-skip.yml. Otherwise all tests would be ignored and did nothing if forced to run anyways.

Oh, sorry, I committed this file 🤦

I also tried running make runtime in native at the LLVM 19 update commit and in the current main just in case there was an update there, but with no luck.

Native is running in the blockifier dependency, which on the branch above is fixed to https://github.com/lambdaclass/cairo_native/commit/79a8dcffe1e562e80198425e7c9f3a328c783947, so even if you update the runtime locally, the executor will be compiled based on the Native version used in the "sequencer" dependency.

azteca1998 commented 1 week ago

I changed that dependency too, otherwise it wouldn't compile. Both blockifier and your crate pointed to my local folder.

enitrat commented 1 week ago

Ok - I re-ran on c866fbf8417cfcf522fd809348a483c5daef2e45, it works, I must've forgotten to clear the cached executor.

enitrat commented 1 week ago

I changed that dependency too, otherwise it wouldn't compile. Both blockifier and your crate pointed to my local folder.

I'm surprised, I don't see any compilation error. I edited the skipfile for it to work properly

enitrat commented 1 week ago

Here's the LLDB trace for the write-after-free issue when running: cargo test --test tests stStack --no-fail-fast --features "v1,native,ci" (runs all tests in a single target and filters for stStack)

Here, we don't see cairo native being active: it's because the initial call is made to a Cairo0 account, thus is executed by the regular CairoVM.

The call flow is tx -> relayer (cairo0) -> account (cairo1) -> kakarot(cairo1) -> end

I guess it fails in the relayer step?

* thread #18, name = 'stStackTests::stStackTests::test_stackOverflowM1DUP_d10g0v0_Can', stop reason = signal SIGABRT
  * frame #0: 0x0000000197ed4764 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000197f0bc28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000197e19ae8 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000197d3ae28 libsystem_malloc.dylib`malloc_vreport + 908
    frame #4: 0x0000000197d515d4 libsystem_malloc.dylib`malloc_zone_error + 104
    frame #5: 0x0000000197d5128c libsystem_malloc.dylib`free_list_checksum_botch + 40
    frame #6: 0x0000000197d32f5c libsystem_malloc.dylib`tiny_free_list_remove_ptr + 452
    frame #7: 0x0000000197d32458 libsystem_malloc.dylib`tiny_free_no_lock + 436
    frame #8: 0x0000000197d32120 libsystem_malloc.dylib`free_tiny + 496
    frame #9: 0x000000010b9a052c tests-0123119940fab6c0`__rdl_realloc [inlined] std::sys::pal::unix::alloc::_$LT$impl$u20$core..alloc..global..GlobalAlloc$u20$for$u20$std..alloc..System$GT$::dealloc::hc599de7cd06b14b4 at alloc.rs:48:9 [opt]
    frame #10: 0x000000010b9a0524 tests-0123119940fab6c0`__rdl_realloc [inlined] std::sys::pal::common::alloc::realloc_fallback::hf475d51777e5c5cb at alloc.rs:56:9 [opt]
    frame #11: 0x000000010b9a04dc tests-0123119940fab6c0`__rdl_realloc [inlined] std::sys::pal::unix::alloc::_$LT$impl$u20$core..alloc..global..GlobalAlloc$u20$for$u20$std..alloc..System$GT$::realloc::he5221d6b126ec4ab at alloc.rs:56:13 [opt]
    frame #12: 0x000000010b9a04dc tests-0123119940fab6c0`__rdl_realloc at alloc.rs:424:13 [opt]
    frame #13: 0x000000010b6d8104 tests-0123119940fab6c0`alloc::alloc::Global::grow_impl::h2451d937db69a051 [inlined] alloc::alloc::realloc::h0c8df755c611e652 at alloc.rs:138:14
    frame #14: 0x000000010b6d80e8 tests-0123119940fab6c0`alloc::alloc::Global::grow_impl::h2451d937db69a051(self=0x0000000136226e28, ptr=(pointer = ""), old_layout=Layout @ 0x000000031fff42f8, new_layout=Layout @ 0x000000031fff4308, zeroed=false) at alloc.rs:215:31
    frame #15: 0x000000010b6d8454 tests-0123119940fab6c0`_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::grow::hf3ec351522dd37f2(self=0x0000000136226e28, ptr=(pointer = ""), old_layout=Layout @ 0x000000031fff4550, new_layout=Layout @ 0x000000031fff4560) at alloc.rs:268:18
    frame #16: 0x000000010b6d5008 tests-0123119940fab6c0`alloc::raw_vec::finish_grow::h74ef05c11f3e31f1(new_layout=Result<core::alloc::layout::Layout, core::alloc::layout::LayoutError> @ 0x000000031fff4620, current_memory=Option<(core::ptr::non_null::NonNull<u8>, core::alloc::layout::Layout)> @ 0x000000031fff48a8, alloc=0x0000000136226e28) at raw_vec.rs:587:13
    frame #17: 0x000000010b1c428c tests-0123119940fab6c0`alloc::raw_vec::RawVec$LT$T$C$A$GT$::grow_amortized::h2e7b27970944a87d(self=0x0000000136226e18, len=6, additional=5) at raw_vec.rs:486:19
    frame #18: 0x000000010b1c165c tests-0123119940fab6c0`alloc::raw_vec::RawVec$LT$T$C$A$GT$::try_reserve::he07a584328cd9550(self=0x0000000136226e18, len=6, additional=5) at raw_vec.rs:372:13
    frame #19: 0x000000010b1700bc tests-0123119940fab6c0`alloc::vec::Vec$LT$T$C$A$GT$::try_reserve::hcf4171a006c4cb35(self=0x0000000136226e18, additional=5) at mod.rs:1039:9
    frame #20: 0x000000010b132290 tests-0123119940fab6c0`cairo_vm::vm::vm_memory::memory::Memory::insert::h72d4a8bef1e04c33(self=0x000000031fff68a0, key=(segment_index = 1, offset = 10), val=0x000000031fff5160) at memory.rs:209:13
    frame #21: 0x000000010b15a32c tests-0123119940fab6c0`cairo_vm::vm::vm_core::VirtualMachine::insert_deduced_operands::h069c2aaece09cf80(self=0x000000031fff6870, deduced_operands=(__0 = '\x01'), operands=0x000000031fff5160, operands_addresses=0x000000031fff5200) at vm_core.rs:395:13
    frame #22: 0x000000010b15a478 tests-0123119940fab6c0`cairo_vm::vm::vm_core::VirtualMachine::run_instruction::hb944f2c213bd0e88(self=0x000000031fff6870, instruction=0x000000011f8495a0) at vm_core.rs:407:9
    frame #23: 0x000000010b15af48 tests-0123119940fab6c0`cairo_vm::vm::vm_core::VirtualMachine::step_instruction::h5738b4ef0ffbbe0d(self=0x000000031fff6870) at vm_core.rs:527:17
    frame #24: 0x000000010b15b290 tests-0123119940fab6c0`cairo_vm::vm::vm_core::VirtualMachine::step::hb91319f27d5fe2de(self=0x000000031fff6870, hint_processor=&mut dyn cairo_vm::hint_processor::hint_processor_definition::HintProcessor @ 0x000000031fff5c60, exec_scopes=0x000000031fff6b38, hint_datas=&[alloc::boxed::Box<dyn core::any::Any, alloc::alloc::Global>] @ 0x000000031fff5c78, constants=0x000000031fff6af0) at vm_core.rs:567:9
    frame #25: 0x000000010b20c98c tests-0123119940fab6c0`cairo_vm::vm::runners::cairo_runner::CairoRunner::run_until_pc::h1501663d902ed11b(self=0x000000031fff6870, address=(segment_index = 9, offset = 0), hint_processor=&mut dyn cairo_vm::hint_processor::hint_processor_definition::HintProcessor @ 0x000000031fff5ee8) at cairo_runner.rs:655:13
    frame #26: 0x000000010b20e798 tests-0123119940fab6c0`cairo_vm::vm::runners::cairo_runner::CairoRunner::run_from_entrypoint::hd3b1809f65144c52(self=0x000000031fff6870, entrypoint=699, args=&[&cairo_vm::vm::runners::cairo_runner::CairoArg] @ 0x000000031fff6458, verify_secure=true, program_segment_size=Option<usize> @ 0x000000031fff6470, hint_processor=&mut dyn cairo_vm::hint_processor::hint_processor_definition::HintProcessor @ 0x000000031fff6480) at cairo_runner.rs:1111:9
    frame #27: 0x0000000101c984cc tests-0123119940fab6c0`blockifier::execution::deprecated_entry_point_execution::run_entry_point::hd0cd38b2c30e115b(runner=0x000000031fff6870, hint_processor=0x000000031fff6c20, entry_point_pc=699, args=Vec<cairo_vm::vm::runners::cairo_runner::CairoArg, alloc::alloc::Global> @ 0x000000031fff8130) at deprecated_entry_point_execution.rs:198:18
enitrat commented 1 week ago

LLDB trace when running: cargo test stStack --features v1,native (only run the stStack tests)

* thread #25, name = 'stStackTests::test_stackOverflowM1DUP_d13g0v0_Cancun', stop reason = signal SIGABRT
  * frame #0: 0x0000000197ed4764 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000197f0bc28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000197e19ae8 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000197d3ae28 libsystem_malloc.dylib`malloc_vreport + 908
    frame #4: 0x0000000197d515d4 libsystem_malloc.dylib`malloc_zone_error + 104
    frame #5: 0x0000000197d51354 libsystem_malloc.dylib`_tiny_check_and_zero_inline_meta_from_freelist + 188
    frame #6: 0x0000000197d2ff70 libsystem_malloc.dylib`tiny_malloc_from_free_list + 640
    frame #7: 0x0000000197d2f670 libsystem_malloc.dylib`tiny_malloc_should_clear + 216
    frame #8: 0x0000000197d2e24c libsystem_malloc.dylib`szone_malloc_should_clear + 92
    frame #9: 0x00000001001fdd80 stStackTests-9af27425979037f3`alloc::alloc::alloc::h333164c5ca5691c6(layout=Layout @ 0x000000037001adb8) at alloc.rs:100:9
    frame #10: 0x00000001001fde98 stStackTests-9af27425979037f3`alloc::alloc::Global::alloc_impl::h51d1578827c2007b(self=0x000000037001afe7, layout=Layout @ 0x000000037001ae58, zeroed=false) at alloc.rs:183:73
    frame #11: 0x00000001001fe438 stStackTests-9af27425979037f3`_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h5f5e0c10f2fbdaf9(self=0x000000037001afe7, layout=Layout @ 0x000000037001afa0) at alloc.rs:243:9
    frame #12: 0x00000001001cf620 stStackTests-9af27425979037f3`alloc::boxed::Box$LT$T$C$A$GT$::try_new_uninit_in::h95ed53467f54c743(alloc=Global @ 0x000000037001afe7) at boxed.rs:529:13
    frame #13: 0x00000001001cf140 stStackTests-9af27425979037f3`alloc::boxed::Box$LT$T$C$A$GT$::new_uninit_in::h3d30583466457255(alloc=Global @ 0x000000037001b0a7) at boxed.rs:491:15
    frame #14: 0x00000001001cc850 stStackTests-9af27425979037f3`alloc::collections::btree::node::LeafNode$LT$K$C$V$GT$::new::h1fcbe726819ce489(alloc=Global @ 0x000000037001b11e) at node.rs:83:28
    frame #15: 0x00000001001d5e20 stStackTests-9af27425979037f3`alloc::collections::btree::map::entry::VacantEntry$LT$K$C$V$C$A$GT$::insert::hf5380ac41efb76b9 [inlined] alloc::collections::btree::node::NodeRef$LT$alloc..collections..btree..node..marker..Owned$C$K$C$V$C$alloc..collections..btree..node..marker..Leaf$GT$::new_leaf::h7db0531265f53255 at node.rs:217:29
    frame #16: 0x00000001001d5e1c stStackTests-9af27425979037f3`alloc::collections::btree::map::entry::VacantEntry$LT$K$C$V$C$A$GT$::insert::hf5380ac41efb76b9(self=VacantEntry<ruint::Uint<256, 4>, ruint::Uint<256, 4>, alloc::alloc::Global> @ 0x000000037001b420, value=Uint<256, 4> @ 0x000000037001b460) at entry.rs:356:32
    frame #17: 0x00000001001fd9ec stStackTests-9af27425979037f3`alloc::collections::btree::map::BTreeMap$LT$K$C$V$C$A$GT$::insert::h9e0752f0049e0ed9(self=0x000000037001b500, key=<unavailable>, value=Uint<256, 4> @ 0x000000037001b610) at map.rs:991:17
    frame #18: 0x00000001001fb504 stStackTests-9af27425979037f3`_$LT$serde..de..impls..$LT$impl$u20$serde..de..Deserialize$u20$for$u20$alloc..collections..btree..map..BTreeMap$LT$K$C$V$GT$$GT$..deserialize..MapVisitor$LT$K$C$V$GT$$u20$as$u20$serde..de..Visitor$GT$::visit_map::hf24149fa3d417930((null)=(marker = core::marker::PhantomData<alloc::collections::btree::map::BTreeMap<ruint::Uint<256, 4>, ruint::Uint<256, 4>, alloc::alloc::Global> > @ 0x000000037001b687), map=MapAccess<serde_json::read::StrRead> @ 0x000000037001b4f0) at impls.rs:1554:29
    frame #19: 0x00000001001e1294 stStackTests-9af27425979037f3`_$LT$$RF$mut$u20$serde_json..de..Deserializer$LT$R$GT$$u20$as$u20$serde..de..Deserializer$GT$::deserialize_map::h1601009c3235db83(self=0x000000037001d2c0, visitor=(marker = core::marker::PhantomData<alloc::collections::btree::map::BTreeMap<ruint::Uint<256, 4>, ruint::Uint<256, 4>, alloc::alloc::Global> > @ 0x000000037001b727)) at de.rs:1795:31
    frame #20: 0x00000001001fdfb0 stStackTests-9af27425979037f3`serde::de::impls::_$LT$impl$u20$serde..de..Deserialize$u20$for$u20$alloc..collections..btree..map..BTreeMap$LT$K$C$V$GT$$GT$::deserialize::h0b063634e9c07089(deserializer=0x000000037001d2c0) at impls.rs:1562:17
    frame #21: 0x00000001001d1a58 stStackTests-9af27425979037f3`_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$serde..de..DeserializeSeed$GT$::deserialize::h26523c67ea91cdc8((null)=PhantomData<alloc::collections::btree::map::BTreeMap<ruint::Uint<256, 4>, ruint::Uint<256, 4>, alloc::alloc::Global>> @ 0x000000037001b8b7, deserializer=0x000000037001d2c0) at mod.rs:800:9
    frame #22: 0x00000001001da2d8 stStackTests-9af27425979037f3`_$LT$serde_json..de..MapAccess$LT$R$GT$$u20$as$u20$serde..de..MapAccess$GT$::next_value_seed::hae6bf5f3728cadfb(self=0x000000037001b9c8, seed=PhantomData<alloc::collections::btree::map::BTreeMap<ruint::Uint<256, 4>, ruint::Uint<256, 4>, alloc::alloc::Global>> @ 0x000000037001b917) at de.rs:2011:9
    frame #23: 0x00000001001d770c stStackTests-9af27425979037f3`serde::de::MapAccess::next_value::hfba31c5fe87f3614(self=0x000000037001b9c8) at mod.rs:1873:9
    frame #24: 0x00000001001f3890 stStackTests-9af27425979037f3`_$LT$ef_tests..models.._..$LT$impl$u20$serde..de..Deserialize$u20$for$u20$ef_tests..models..Account$GT$..deserialize..__Visitor$u20$as$u20$serde..de..Visitor$GT$::visit_map::h7e8f82f91c8fde30((null)=(marker = core::marker::PhantomData<ef_tests::models::Account> @ 0x000000037001c077, lifetime = core::marker::PhantomData<void *> @ 0x000000037001c077), __map=MapAccess<serde_json::read::StrRead> @ 0x000000037001b9c8) at models.rs:198:32

-> In this one, we don't even reach the step where a tx is sent 🤔 Could there be side effects due to multithreading and the execution of another test in parallel?

enitrat commented 1 week ago

I believe it was introduced in https://github.com/lambdaclass/cairo_native/pull/815/commits/4aa739beced8aae5894c8aa2dcaee29bb33589bb

I did a run on https://github.com/lambdaclass/cairo_native/pull/815/commits/483726eacfef8cac8ebd27f66f975c2ede2eb49f - fine

The next one on https://github.com/lambdaclass/cairo_native/pull/815/commits/47ffbc1b41fefc54df83c17a6eb8283441d67da0 - broken

azteca1998 commented 1 week ago

You're right. I implemented an if statement backwards. Could you try with #832?

enitrat commented 1 week ago

Fixed 👍

azteca1998 commented 1 week ago

What else was pending?

enitrat commented 1 week ago

Memory usage looks a lot better - perhaps a few persisting issues?:

image

I am starting a run on CI and will report back in ~1hr

I think everything's fine now (secp256 point is on the blockifier side)

enitrat commented 1 week ago

Unfortunately it still fails our CI... I think it consumes too much memory https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11182543855/job/31089141465

There's no explicit shutdown message so I believe we're just hitting a resource limit

enitrat commented 1 week ago

https://github.com/lambdaclass/cairo_native/issues/826#issuecomment-2391351065

The max memory usage hit in a run previously (before the snapshot-drop fixes) was ~13Gb, now it's 25Gb which might be too big for our runner. So we probably will need to upgrade it, or find a way to decrease memory consumption in Native mode

azteca1998 commented 1 week ago

Wait, don't upgrade yet. I know for a fact we still have some memory leaks.

I'm working on a piece of code that'll help me find them. I'll let you know once I've fixed the ones I've found.

enitrat commented 1 week ago

thank you very much 🙏

azteca1998 commented 1 week ago

Could you try with #833? That one fixes all the memory leaks I could find running all our JIT tests.

enitrat commented 1 week ago
image

looks good now !

enitrat commented 1 week ago

hm actually I think there is a very specific issue that happens in specific test cases that end up running forever: CI took 6hrs to run and ended up failing due to time limits (instead of finishing in ~50mins).

I'll investigate this manually later today

https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11222856617/job/31196253181?pr=1011

enitrat commented 1 week ago

I just hit an issue when running locally:

* thread #6, name = 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun', stop reason = EXC_BAD_ACCESS (code=2, address=0x170633e80)
    frame #0: 0x000000017280212c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::instructions::memory_operations::MemoryOperation::exec_pc(f179) + 18816
504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::instructions::memory_operations::MemoryOperation::exec_pc(f179):
->  0x17280212c <+18816>: str    x22, [x19, #0x20]
    0x172802130 <+18820>: ldr    x22, [sp, #0x1108]
    0x172802134 <+18824>: str    x22, [x19, #0x10]
    0x172802138 <+18828>: ldr    w22, [sp, #0x338]
(lldb) bt
* thread #6, name = 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun', stop reason = EXC_BAD_ACCESS (code=2, address=0x170633e80)
  * frame #0: 0x000000017280212c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::instructions::memory_operations::MemoryOperation::exec_pc(f179) + 18816
    frame #1: 0x000000017248f9c8 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_opcode(f135) + 381252
    frame #2: 0x00000001723ec248 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_code(f107) + 9600
    frame #3: 0x00000001723949a4 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::process_message(f78) + 24288
    frame #4: 0x0000000172ae167c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::call_helpers::CallHelpersImpl::generic_call(f255) + 47004
    frame #5: 0x00000001726d5b90 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::instructions::system_operations::SystemOperations::exec_staticcall(f148) + 103480
    frame #6: 0x00000001725bb4f8 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_opcode(f135) + 1608820
    frame #7: 0x00000001723ec248 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_code(f107) + 9600
    frame #8: 0x00000001723949a4 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::process_message(f78) + 24288
    frame #9: 0x0000000172ae167c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::call_helpers::CallHelpersImpl::generic_call(f255) + 47004
    frame #10: 0x00000001726d5b90 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::instructions::system_operations::SystemOperations::exec_staticcall(f148) + 103480
    frame #11: 0x00000001725bb4f8 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_opcode(f135) + 1608820
    frame #12: 0x00000001723ec248 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_code(f107) + 9600
    frame #13: 0x00000001723949a4 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::process_message(f78) + 24288
    frame #14: 0x0000000172ae167c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::call_helpers::CallHelpersImpl::generic_call(f255) + 47004
    frame #15: 0x0000000172787ca8 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::instructions::system_operations::SystemOperations::exec_call(f154) + 139700
    frame #16: 0x000000017259f078 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_opcode(f135) + 1492980
    frame #17: 0x00000001723ec248 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::execute_code(f107) + 9600
    frame #18: 0x00000001723949a4 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::process_message(f78) + 24288
    frame #19: 0x0000000172345fc0 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::process_message_call(f62) + 5284
    frame #20: 0x00000001722fa60c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$evm::interpreter::EVMImpl::process_transaction(f48) + 72040
    frame #21: 0x00000001722c18e0 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$contracts::kakarot_core::eth_rpc::EthRPC::<contracts::kakarot_core::kakarot::KakarotCore::ContractState, contracts::kakarot_core::kakarot::KakarotCore::_KakarotCoreState, contracts::kakarot_core::kakarot::KakarotCore::ContractStateDrop>::eth_send_transaction(f39) + 15028
    frame #22: 0x0000000172290f5c 504898746202199744673968940527932241844513528453156538068464172101734859290`impl$contracts::kakarot_core::eth_rpc::__wrapper__EthRPC__eth_send_transaction::<contracts::kakarot_core::kakarot::KakarotCore::ContractState, contracts::kakarot_core::kakarot::KakarotCore::_KakarotCoreState, contracts::kakarot_core::kakarot::KakarotCore::ContractStateDrop, contracts::kakarot_core::kakarot::KakarotCore::ContractStateEthRPC>(f23) + 4148
    frame #23: 0x00000001722921b0 504898746202199744673968940527932241844513528453156538068464172101734859290`_mlir_ciface_f23 + 84
    frame #24: 0x00000001007f2d94 stStaticCall-b270e27c5493ce0a`invoke_trampoline + 108
    frame #25: 0x00000001004c7d94 stStaticCall-b270e27c5493ce0a`blockifier::execution::native::utils::run_native_executor::hbc529b733575cb07(native_executor=0x00006000021cc010, function_id=0x0000000117015b00, call=<unavailable>, syscall_handler=<unavailable>) at utils.rs:45:28

will add instructions to repro from scratch (if I manage to)

enitrat commented 1 week ago

^ so actually the above happens if after I increase the default stack size with export RUST_MIN_STACK=16777216. If I don't, it just stack-overflows. Same repro instructions as in this issue but running cargo test test_static_ABAcalls0_d1g0v0_Cancun --features v1,native -- --nocapture

azteca1998 commented 6 days ago

I'm having difficulties running that command. The compiler complains about reth-trie.

error[E0277]: the trait bound `reth_primitives::Account: From<revm::revm_primitives::AccountInfo>` is not satisfied
  --> /Users/esteve/.cargo/git/checkouts/reth-36d3ea1d1152b20c/75b7172/crates/trie/trie/src/state.rs:35:38
   |
35 |                 let hashed_account = account.info.clone().map(Into::into);
   |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `From<revm::revm_primitives::AccountInfo>` is not implemented for `reth_primitives::Account`, which is required by `revm::revm_primitives::AccountInfo: Into<_>`
   |
   = help: the following other types implement trait `From<T>`:
             `reth_primitives::Account` implements `From<&GenesisAccount>`
             `reth_primitives::Account` implements `From<reth_primitives::revm_primitives::AccountInfo>`
   = note: required for `revm::revm_primitives::AccountInfo` to implement `Into<reth_primitives::Account>`

error[E0277]: the trait bound `reth_primitives::Account: From<revm::revm_primitives::AccountInfo>` is not satisfied
  --> /Users/esteve/.cargo/git/checkouts/reth-36d3ea1d1152b20c/75b7172/crates/trie/trie/src/state.rs:35:63
   |
35 |                 let hashed_account = account.info.clone().map(Into::into);
   |                                                               ^^^^^^^^^^ the trait `From<revm::revm_primitives::AccountInfo>` is not implemented for `reth_primitives::Account`, which is required by `revm::revm_primitives::AccountInfo: Into<_>`
   |
   = help: the following other types implement trait `From<T>`:
             `reth_primitives::Account` implements `From<&GenesisAccount>`
             `reth_primitives::Account` implements `From<reth_primitives::revm_primitives::AccountInfo>`
   = note: required for `revm::revm_primitives::AccountInfo` to implement `Into<reth_primitives::Account>`

error[E0277]: the trait bound `reth_primitives::Account: From<revm::revm_primitives::AccountInfo>` is not satisfied
  --> /Users/esteve/.cargo/git/checkouts/reth-36d3ea1d1152b20c/75b7172/crates/trie/trie/src/state.rs:62:86
   |
62 |                 let hashed_account = account.account.as_ref().map(|a| a.info.clone().into());
   |                                                                                      ^^^^ the trait `From<revm::revm_primitives::AccountInfo>` is not implemented for `reth_primitives::Account`, which is required by `revm::revm_primitives::AccountInfo: Into<_>`
   |
   = help: the following other types implement trait `From<T>`:
             `reth_primitives::Account` implements `From<&GenesisAccount>`
             `reth_primitives::Account` implements `From<reth_primitives::revm_primitives::AccountInfo>`
   = note: required for `revm::revm_primitives::AccountInfo` to implement `Into<reth_primitives::Account>`
help: consider removing this method call, as the receiver has type `revm::revm_primitives::AccountInfo` and `revm::revm_primitives::AccountInfo: From<revm::revm_primitives::AccountInfo>` trivially holds
   |
62 -                 let hashed_account = account.account.as_ref().map(|a| a.info.clone().into());
62 +                 let hashed_account = account.account.as_ref().map(|a| a.info.into());
   |
enitrat commented 6 days ago

@tcoratger can you help here?

enitrat commented 6 days ago

@azteca1998 i'm unable to repro your issue from a fresh clone.

I have:

❯ rustc --version
rustc 1.81.0 (eeb90cda1 2024-09-04)

can you try a cargo clean and to not run cargo update ?

I also synced the repo with latest cairo-native main, do a git pull

edit: after syncing with latest cairo-native main, i can't deserialize my v1 classes anymore to run the test anymore - investigating what changed

tcoratger commented 6 days ago

Yeah probably some versioning conflict (maybe between revm and reth which use revm as an underlying dependency) because I'm sure that the conversion mentioned in your error message is available in reth here https://github.com/paradigmxyz/reth/blob/75b7172cf77eb4fd65fe1a6924f75066fb09fcd1/crates/primitives-traits/src/account.rs#L165

azteca1998 commented 2 days ago

I'm not sure it'll be fixed, but could you try with the latest main? We just merged a fix for a bug that may or may not be related to this issue (it's difficult to tell).

enitrat commented 1 day ago

Nothing changed :(

I am still stuck with this issue:

This used to be fine with 16Mib before iirc

Did #833 significantly increase stack usage ?

azteca1998 commented 1 day ago

It shouldn't have changed the stack memory usage. It could be that it now takes more for it to crash, therefore needs a bit more stack.

I'll keep investigating then, thanks for checking it.

azteca1998 commented 1 day ago

Hello, I've finally managed to fix it. It took some modifications in blockifier to be able to compile and run it but it ran.

It seems the issue (or the part that was causing the segfault) was related to the bug I fixed the other day, but I forgot to fix it in the syscall handler. @edg-l noticed and fixed it, so right now it doesn't segfault anymore, but still doesn't pass:

path: "../../build/v1/contracts_KakarotCore.contract_class.json"
path: "../../build/v1/contracts_AccountContract.contract_class.json"
path: "../../build/v1/contracts_UninitializedAccount.contract_class.json"
Got uninitialized account's json of length 48297
Got account's json of length 660024
Got kakarot's json of length 14061466
raw json length 48297
class def parsed
native contract
Creating new executor
thread 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun' panicked at crates/ef-testing/src/evm_sequencer/sequencer/mod.rs:247:121:
called `Result::unwrap()` on an `Err` value: "not a valid contract class"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
enitrat commented 1 day ago

@azteca1998 is your CAIRO_NATIVE_RUNTIME_LIBRARY in available as environment variable?

also: i don't think you're using the latest branch version (can you git pull?) in the latest branch you would see the reason why the executor cannot be created, and you wouldnt need blockifier changes I believe (rev b4d9b48a0209ded413eb31438d73a8dfd7b1cd37)

edit: bumped native latest main, ef-test rev is: 6afd893c2c7b987a8eeeb90e302f982438f09a69, checking if if works now.

edg-l commented 1 day ago

@azteca1998 is your CAIRO_NATIVE_RUNTIME_LIBRARY in available as environment variable?

also: i don't think you're using the latest branch version (can you git pull?) in the latest branch you would see the reason why the executor cannot be created, and you wouldnt need blockifier changes I believe (rev b4d9b48a0209ded413eb31438d73a8dfd7b1cd37)

edit: bumped native latest main, ef-test rev is: 6afd893c2c7b987a8eeeb90e302f982438f09a69, checking if if works now.

Just want to comment that native main doesn't yet have this fix in https://github.com/lambdaclass/cairo_native/pull/853 so you should try with that

enitrat commented 1 day ago

ah, ok i will probably wait for it to be on main as i'm focused mainly on something else atm

azteca1998 commented 1 day ago

I thought it was already merged, but it seems that wasn't the case. Anyways, one of the PRs merged today has fixed the segfault (it isn't failing anymore in main).

enitrat commented 4 hours ago

Checking today on 84ceaa3a0993b322130f6774885a61d911b218f2, no improvements

I'm wondering if it's a side effect from increasing the RUST_MIN_STACK size to a high number?

azteca1998 commented 3 hours ago

no improvements

What do you mean? It still segfaults for you?

enitrat commented 2 hours ago

still getting a stack overflow with default / low stack sizes, or a signal: 11, SIGSEGV: invalid memory reference with an increased stack size (see https://github.com/lambdaclass/cairo_native/issues/831#issuecomment-2411925922)

azteca1998 commented 2 hours ago

Strange, did you cargo clean and rebuild everything (including the runtime)? The stack size shouldn't have an effect on causing segfaults.