lambdaclass / cairo_native

A compiler to convert Cairo's intermediate representation "Sierra" code to MLIR.
https://lambdaclass.github.io/cairo_native/cairo_native
Apache License 2.0
121 stars 43 forks source link

bug: thread has overflowed its stack #816

Open enitrat opened 2 months ago

enitrat commented 2 months ago

Repro: scarb 2.8.3 clone https://github.com/kkrt-labs/kakarot-ssj on main scarb build -p contracts - builds the required contract with some debug print logs

clone https://github.com/kkrt-labs/ef-tests ensure correct branch git checkout feat/cairo-native make make setup-kakarot copy the contracts built previously:cp ../kakarot-ssj/target/dev/contracts_* build/v1 cargo test test_baseFeeDiffPlaces_d34g0v0_Cancun --features v1,native -- --nocapture

output:

thread 'stEIP1559::test_baseFeeDiffPlaces_d34g0v0_Cancun' has overflowed its stack
fatal runtime error: stack overflow
azteca1998 commented 2 months ago

Could you check where is the stack overflowing? Depending on where it is, it could mean different things:

In the first case we should report it to LLVM and wait until it gets solved (or try to solve it ourselves, of course).

The latter case is just a limitation of the current architectures and there's not much more we can do other than just invoking the contract with a bigger stack, but you could end up with gigabytes of stack depending on the program.

Some lore about this: Sierra doesn't have the concept of loops; all loops are implemented as recursive functions. Every iteration of the recursive function will use part of the stack. We implemented a tail recursion to iteration conversion for tail-recursive functions so that we can run loop with more than a few hundred/thousand iterations, but this transformation only works for tail-recursive functions. Non-tail-recursive functions are still implemented normally. More information in src/metadata/tail_recursion.rs.

A workaround for tests which are known to fail with stack overflow can be to run the test in a separate thread with a custom size stack. Rust provides an API to do that.

enitrat commented 2 months ago

@azteca1998 im not super familiar with debugging rust panics like this: what's the usual process to find where the stack is overflowing?

enitrat commented 2 months ago

ok - more update on this:

I ran the test with export RUST_MIN_STACK=16777216 to increase the stack size. Now, it passes.

More details:

this specific test has a lot of recursions, running 1024 nested ethereum calls, and is quite heavy. Thus I think that's why we hit the default rust stack limits.

azteca1998 commented 2 months ago

Oh, I see. If I understood correctly it's not that there's a non-tail-recursive loop but rather that there's a lot of nested contract calls, right? We hadn't seen any crash because of that yet.

This is (more or less) the same case as a stack overflow due to non-tail-recursive loop; that is, we can't do much more than increase the stack size. There is one difference though: we can do something about it. Once rust stabilizes coroutines we could make the runner use the coroutine, which should have its own separate stack instead of the main one. A different approach would be to start them in threads but they're not as cheap.

enitrat commented 2 months ago

Oh, I see. If I understood correctly it's not that there's a non-tail-recursive loop but rather that there's a lot of nested contract calls, right? We hadn't seen any crash because of that yet.

Actually, there's only a single Starknet contract call to the Kakarot main contract. However, there are roughly 1024 EVM contract calls executed inside this test, which is perform a lot of operations. So in the end it's just a very heavy computation

azteca1998 commented 2 months ago

Then it might be that the VM's main loop is not tail recursive.

rodrigo-pino commented 2 months ago

I have a question, if the VM is more resource intensive than Native (thinking on the fact that segments can only grow), shouldn't VM execution crash as well?

azteca1998 commented 2 months ago

Not really. As you said, the VM uses segments which are just Vec<_>. Every .push(_) or (safe) operation will check the vector length and act accordingly.

This is not the case in native because we're using the CPU's stack, which is just a pointer. The program doesn't know how much stack it has remaining, and even if we know I'm not sure how we could manipulate the MLIR to check that and avoid crashes.

What could work is hooking into the SIGSEGV process signal and "guess" if it was a stack overflow or a different cause for the segfault, then try to recover the execution from there. I think I tried something like that once for a personal project, but that's definitely 100% undefined behavior which we do not want in native.