lambdaclass / cairo_native

A compiler to convert Cairo's intermediate representation "Sierra" code to MLIR.
https://lambdaclass.github.io/cairo_native/cairo_native
Apache License 2.0
121 stars 43 forks source link

bug: invalid memory reference #870

Open enitrat opened 1 month ago

enitrat commented 1 month ago

Where are you using Cairo Native Kakarot EF-Test suite. Running the 20k ethereum tests using blockifier + cairo native + kakarot zkevm

Version last main

Describe the bug

* thread #2, name = 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f5c6bf0)
  * frame #0: 0x00000001473975a0 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13513 + 36
    frame #1: 0x000000014736798c 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13487 + 13692
    frame #2: 0x0000000147331a8c 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13450 + 23400
    frame #3: 0x0000000147a1f600 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13635 + 40296b
...
...
    frame #103: 0x00000001472fc398 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13425 + 4520
    frame #104: 0x00000001472c4248 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13402 + 71760
    frame #105: 0x00000001472a284c 2170791277312818364151319127822404238116644699931524383412296455457210549381`___lldb_unnamed_symbol13382 + 14772
    frame #106: 0x0000000147291e88 2170791277312818364151319127822404238116644699931524383412296455457210549381`_mlir_ciface_f22 + 5372
    frame #107: 0x000000010076d66c stStaticCall-a00b9a1f5b0b2270`invoke_trampoline + 108
    frame #108: 0x00000001004a3dd8 stStaticCall-a00b9a1f5b0b2270`blockifier::execution::native::utils::run_native_executor::hd53e73f315696524(native_executor=0x00006000021c10f0, function_id=0x000000012c823040, call=<unavailable>, syscall_handler=<unavailable>) at utils.rs:45:28
    frame #109: 0x000000010046598c stStaticCall-a00b9a1f5b0b2270`blockifier::execution::native::entry_point_execution::execute_entry_point_call::h0820f83788dc89ed(call=CallEntryPoint @ 0x000000016ffdfc38, contract_class=NativeContractClassV1 @ 0x000000016ffdef38, state=&mut dyn blockifier::state::state_api::State @ 0x000000016ffdf6d8, resources=0x000000016fff0538, context=0x000000016ffef588) at entry_point_execution.rs:33:18
    frame #110: 0x000000010059f768 stStaticCall-a00b9a1f5b0b2270`blockifier::execution::execution_utils::execute_entry_point_call::hba155b0ecfc46347(call=CallEntryPoint @ 0x000000016ffe0938, contract_class=ContractClass @ 0x000000016ffdf7d0, state=&mut dyn blockifier::state::state_api::State @ 0x000000016ffe0158, resources=0x000000016fff0538, context=0x000000016ffef588) at execution_utils.rs:78:19
    frame #111: 0x000000010049e5d0 stStaticCall-a00b9a1f5b0b2270`blockifier::execution::entry_point::CallEntryPoint::execute::h8763f8070c7f5197(self=CallEntryPoint @ 0x000000016ffe19a0, state=&mut dyn blockifier::state::state_api::State @ 0x000000016ffe0a08, resources=0x000000016fff0538, context=0x000000016ffef588) at entry_point.rs:103:9

To Reproduce

Clone https://github.com/kkrt-labs/ef-tests Checkout dev/bump-native make && make setup-kakarot Make sure CAIRO_NATIVE_RUNTIME_LIBRARY is a defined environment variable run cargo test test_static_ABAcalls0_d1g0v0_Cancun --features v1,native -- --nocapture this should error with (signal: 11, SIGSEGV: invalid memory reference)

The backtrace is from lldb: lldb path/to/test platform settings -w /path/to/ef-tests/crates/ef-testing run

Desktop (please complete the following information):

edg-l commented 1 month ago

I tried to build this on linux x86_64 but i get a build error

error[E0412]: cannot find type `integer_t` in crate `libc`
   --> crates/ef-testing/src/models/result.rs:120:36
    |
120 |             / mem::size_of::<libc::integer_t>() as libc::c_uint;
    |                                    ^^^^^^^^^ help: a type alias with a similar name exists: `intptr_t`
    |
   ::: /home/edgar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libc-0.2.159/src/unix/mod.rs:21:1
    |
21  | pub type intptr_t = isize;
    | ----------------- similarly named type alias `intptr_t` defined here

warning: unused imports: `ACCOUNT_CONTRACT_CLASS`, `KAKAROT_CLASS`, and `UNINITIALIZED_ACCOUNT_CLASS`
  --> crates/ef-testing/src/evm_sequencer/sequencer/mod.rs:12:9
   |
12 |         ACCOUNT_CONTRACT_CLASS, ACCOUNT_CONTRACT_CLASS_HASH, BLOCK_GAS_LIMIT,
   |         ^^^^^^^^^^^^^^^^^^^^^^
13 |         ETH_FEE_TOKEN_ADDRESS, FEE_TOKEN_CLASS, FEE_TOKEN_CLASS_HASH, KAKAROT_ADDRESS,
14 |         KAKAROT_CLASS, KAKAROT_CLASS_HASH, KAKAROT_OWNER_ADDRESS, OPENZEPPELIN_ACCOUNT_CLASS,
   |         ^^^^^^^^^^^^^
15 |         OPENZEPPELIN_ACCOUNT_CLASS_HASH, RELAYER_ADDRESS, RELAYER_BALANCE, RELAYER_VERIFYING_KEY,
16 |         STRK_FEE_TOKEN_ADDRESS, UNINITIALIZED_ACCOUNT_CLASS, UNINITIALIZED_ACCOUNT_CLASS_HASH,
   |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(unused_imports)]` on by default

warning: unused import: `std::ptr`
  --> crates/ef-testing/src/models/result.rs:82:5
   |
82 | use std::ptr;
   |     ^^^^^^^^

For more information about this error, try `rustc --explain E0412`.
warning: `ef-testing` (lib) generated 2 warnings
error: could not compile `ef-testing` (lib) due to 1 previous error; 2 warnings emitted
warning: build failed, waiting for other jobs to finish...
enitrat commented 1 month ago

I tried to build this on linux x86_64 but i get a build error

Must be a platform-specific thing:

can you comment line 151 and the definition of foo line 108? I used this to get memory usage logs after each test to identify families that caused memory leaks.

azteca1998 commented 1 month ago

The make steup-kakarot step is failing for me. Is this expected?

ef-tests$ make setup-kakarot
rm -rf build/common
mkdir -p build/common
rm -rf build/v0
mkdir -p build/v0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8651  100  8651    0     0  25596      0 --:--:-- --:--:-- --:--:-- 25670
unzip -o kakarot-build.zip -d build/v0
Archive:  kakarot-build.zip
  inflating: build/v0/contracts_MockContractUpgradeableV0.compiled_contract_class.json
  inflating: build/v0/contracts_UninitializedAccount.contract_class.json
  inflating: build/v0/contracts_AccountContract.contract_class.json
  inflating: build/v0/contracts_UninitializedAccount.compiled_contract_class.json
  inflating: build/v0/contracts.starknet_artifacts.json
  inflating: build/v0/contracts_KakarotCore.compiled_contract_class.json
  inflating: build/v0/contracts_MockContractUpgradeableV1.contract_class.json
  inflating: build/v0/contracts_ERC20.compiled_contract_class.json
  inflating: build/v0/contracts_ERC20.contract_class.json
  inflating: build/v0/contracts_Cairo1HelpersFixture.compiled_contract_class.json
  inflating: build/v0/contracts_AccountContract.compiled_contract_class.json
  inflating: build/v0/contracts_KakarotCore.contract_class.json
  inflating: build/v0/contracts_MockContractUpgradeableV0.contract_class.json
  inflating: build/v0/contracts_Cairo1Helpers.contract_class.json
  inflating: build/v0/contracts.sierra.json
  inflating: build/v0/contracts_Cairo1Helpers.compiled_contract_class.json
  inflating: build/v0/contracts_Cairo1HelpersFixture.contract_class.json
  inflating: build/v0/contracts_MockContractUpgradeableV1.compiled_contract_class.json
mv build/v0/build/* build/v0
mv: rename build/v0/build/* to build/v0/*: No such file or directory
make: *** [setup-kakarot-v0] Error 1

This seems to cause the precompiles to not be available when I try to test it:

    Finished `test` profile [unoptimized + debuginfo] target(s) in 18.76s
     Running tests/stStaticCall.rs (target/debug/deps/stStaticCall-022d7fbb57a5b4fd)

running 1 test
thread 'stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun' panicked at crates/ef-testing/src/evm_sequencer/constants.rs:66:120:
Failed to load precompiles contract class: No such file or directory (os error 2)

Location:
    crates/ef-testing/src/evm_sequencer/constants.rs:22:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun ... FAILED

failures:

failures:
    stStaticCall::test_static_ABAcalls0_d1g0v0_Cancun

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 477 filtered out; finished in 0.10s

error: test failed, to rerun pass `-p ef-testing --test stStaticCall`
enitrat commented 1 month ago

Something broke with our new release yesterday. Will check and update

enitrat commented 3 weeks ago

Hi @azteca1998 @edg-l you should be able to repro again. I updated native to ab478323d6aee5e0424712bbde98de443b8cc72f

I also hid the ram usage prints under a macos target, so that you should be able to run it on other targets fine.

azteca1998 commented 5 days ago

Hello, sorry for the delay.

We've been working on various stuff in native which made it so that ef-tests no longer compiles using the main branch of the sequencer and native's main branches. Could you please update it? Some of the recent changes may be involved in the bug.

enitrat commented 4 days ago

Hi @azteca1998 i updated the ef-test repo to commit 28caad05e4a13f935acff83ea144df0b1bc24e2e with native 5e60089288c461eca98bf3dbe03cc882778ff677. The issue still occurs

azteca1998 commented 4 days ago

Ok, I think I found the problem. It's the same as #816.

If I run the test with lldb and compare stack pointers I get the following:

Stack pointer (first frame): 0x00007ffff7a1da80
Stack pointer (crash frame): 0x00007ffff780bce0
Crash pointer              : 0x00007ffff780d138

The first number is the value of rsp (stack pointer) when the program started. The second is its value just before the crash. The last number is already outside the allocated stack size, therefore crashing the process.

If I run the test with a larger stack (1GiB in my test, probably works with way less stack) the test seems to run without problems. You can make the tests run with a larger stack by using this environment variable: RUST_MIN_STACK=1073741824.

enitrat commented 3 days ago

Ok, great, at the time I opeoned the issue increasing the stack size did not have any effect, which no longer seems the be the case. However, the current run is taking more than 2hours, up from 50mins before.

https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11947735661/job/33304216235?pr=1011

I will need to check why

azteca1998 commented 3 days ago

Could it be that the test that crashed was run at about the 50 minutes mark, therefore stopping all other tests, while now all tests have to be run, taking more than 2 hours?

enitrat commented 3 days ago

No, previously all 20k tests were passing in ~50mins (20mins compilation, 30mins runtime). I added some more debug info to pinpoint problem is exactly, will update you when I have results

enitrat commented 3 days ago

2024-11-21T13:08:33.3640356Z Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' completed in 2942.0 seconds 2024-11-21T13:08:33.3641902Z test stReturnDataTest::stReturnDataTest::test_modexp_modsize0_returndatasize_d4g0v0_Cancun ... ok

Some tests take 45 mins to 1.5hrs to complete a single, easy test. Do you have any idea how to debug why this is happening? I'm not sure what to look for.

test_modexp_modsize0_returndatasize_d4g0v0_Cancun

That specific instance has a lot of felt252 dict operations because it's very unoptimized. I'll add some prints directly in the cairo code to see what happens in realtime, but that's a first thing we could be looking at.

azteca1998 commented 3 days ago

Not sure if that's the case, especially since it didn't happen earlier, but dictionaries are implemented in Rust within the runtime. When running the JIT, the runtime used is the dependency of cairo-native, which for debug builds is built using the debug profile. Could you try running the test in release mode?

When running in AOT, the runtime used is the static library. If the tests are running in AOT, could you check if the linked library (the path in the env var) points to a release build?

enitrat commented 3 days ago

JIT is new right? I think i'm running in AOT because the integration in Native uses AotExecutor. What do you think I should be using for this specific usecase ?

I built the runtime running make runtime which builds in release mode, so it's a release build I believe.

azteca1998 commented 3 days ago

JIT is the original executor, AOT came later, but I think you're probably using AOT. If the runtime uses the release build then I'm not sure what could be causing the performance issues.

What do you think I should be using for this specific usecase ?

Both are fine. Internally, their interface is the same, so whatever works on one should work on the other. The sequencer is using AOT.

enitrat commented 3 days ago
...
Start: Executing Modexp Precompile
Current test duration: 803.0s
...
WARNING: Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' has been running for over 120 seconds
Current duration: 1516.0s
End: MPNAT::from_big_endian operation
Start: modpow operation
End: modpow operation
End: Executing Modexp Precompile

There seems to be an issue with the operations performed in MPNAT::from_big_endian, executed twice, taking ~ 10 mins so 5mins each.

Here's a link to the function https://github.com/kkrt-labs/kakarot-ssj/blob/c2e4ccbfcd8d6b8d79668f34daab97080d6769f9/crates/utils/src/crypto/modexp/mpnat.cairo#L38-L106

It involves a lot of dict operations. Do you want me to add more granularity?

azteca1998 commented 3 days ago

Can you provide the bytes argument? This way we'll be able to replicate it exactly.

enitrat commented 3 days ago

Can you provide the bytes argument? This way we'll be able to replicate it exactly.

Hm so actually in that specific case I checked and we're sending 999188 bytes to the function. It's quite a lot, I can't really send it over to you.

edit: I opened old logs of using native to find out that this test actually used to pass when using native ae17dd370a7bbf6affeefb9fa6954965e8b52239 🤔 so perhaps the big number of bytes is not that much of an issue and it should pass regardless?

see https://github.com/kkrt-labs/kakarot-ssj/actions/runs/11810716802/job/32903209209 from PR https://github.com/kkrt-labs/kakarot-ssj/pull/1021 (where the version of native is ae17dd370a7bbf6affeefb9fa6954965e8b52239): download the logs, open 2_ef-tests _ ef-tests.txt, search for modexp_modsize0_returndatasize_d4g0v0_Cancun and you will see it passed

enitrat commented 2 days ago

Just re-ran on my computer the stack using cairo-native ae17dd370a7bbf6affeefb9fa6954965e8b52239

WARNING: Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' has been running for over 120 seconds
Current duration: 834.0s
... some additional operations ..

 INFO ef_testing::models::result: stReturnDataTest::modexp_modsize0_returndatasize_d4g0v0_Cancun passed: TransactionResources { starknet_resources: StarknetResources { calldata_length: 155, state_changes_for_fee: StateChangesCount { n_storage_updates: 3, n_class_hash_updates: 0, n_compiled_class_hash_updates: 0, n_modified_contracts: 3 }, message_cost_info: MessageL1CostInfo { l2_to_l1_payload_lengths: [], message_segment_length: 0 }, l1_handler_payload_size: None, n_events: 1, signature_length: 2, code_size: 0, total_event_keys: 1, total_event_data_size: 3 }, vm_resources: ExecutionResources { n_steps: 7527, n_memory_holes: 53, builtin_instance_counter: {ecdsa: 1, range_check: 117, pedersen: 173} }, n_reverted_steps: 0 }
Current memory usage: Ok(4963549184) bytes

Test 'modexp_modsize0_returndatasize_d4g0v0_Cancun' completed in 865.0 seconds
test stReturnDataTest::test_modexp_modsize0_returndatasize_d4g0v0_Cancun ... ok

So that's like ~ 10 seconds for the actual test run. Compared to infinite time now 🤔

To summarize:

The test takes around 10 mins to run, and other tests also have extremely degraded performance.

Let me know how I could help you more.

Regarding the bytes passed to the function:

bytes.len(): bytes.len(): 999188
bytes slice 128: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
bytes slice last 128: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I think it's an array of size 999188 whose first element is 1 and the rest is 0.

azteca1998 commented 2 days ago

It seems that between the previous Cairo native version and the current one we've added proper memory management (now we deallocate stuff properly). That may be part of the performance issue. Here are the commits I suspect will have more impact in the program's performance:

79a8dcf - Fix clone and drop implementations for enums, structs and snapshots.
0b73f1c - Fix `array_get`'s drop implementation.
cff7a11 - Update to 2.8.4, add release docs, version alpha.3
82c25b3 - Add malloc tracing and fix more memory leaks.

If you're going to test them, please keep in mind it may start crashing again since there have been some bug fixes after those commits. I recommend testing using a test that was known to pass but still has had a performance regression.

enitrat commented 2 days ago

Would there be another way to check what's taking so long? testing each one of these commits manually sounds tedious because I have to re-adapt the sequencer impl every time

azteca1998 commented 2 days ago

Yes, we're working (since about an hour ago) in a tool to profile programs a bit better. I'm not sure if it'll be useful at first because we're more interested in the time taken by cairo vs the syscall handler, but it should be adaptable to this use case later on.