Open enitrat opened 1 month ago
Could you check where is the stack overflowing? Depending on where it is, it could mean different things:
In the first case we should report it to LLVM and wait until it gets solved (or try to solve it ourselves, of course).
The latter case is just a limitation of the current architectures and there's not much more we can do other than just invoking the contract with a bigger stack, but you could end up with gigabytes of stack depending on the program.
Some lore about this: Sierra doesn't have the concept of loops; all loops are implemented as recursive functions. Every iteration of the recursive function will use part of the stack. We implemented a tail recursion to iteration conversion for tail-recursive functions so that we can run loop with more than a few hundred/thousand iterations, but this transformation only works for tail-recursive functions. Non-tail-recursive functions are still implemented normally. More information in src/metadata/tail_recursion.rs
.
A workaround for tests which are known to fail with stack overflow can be to run the test in a separate thread with a custom size stack. Rust provides an API to do that.
@azteca1998 im not super familiar with debugging rust panics like this: what's the usual process to find where the stack is overflowing?
ok - more update on this:
I ran the test with export RUST_MIN_STACK=16777216
to increase the stack size.
Now, it passes.
More details:
this specific test has a lot of recursions, running 1024 nested ethereum calls, and is quite heavy. Thus I think that's why we hit the default rust stack limits.
Oh, I see. If I understood correctly it's not that there's a non-tail-recursive loop but rather that there's a lot of nested contract calls, right? We hadn't seen any crash because of that yet.
This is (more or less) the same case as a stack overflow due to non-tail-recursive loop; that is, we can't do much more than increase the stack size. There is one difference though: we can do something about it. Once rust stabilizes coroutines we could make the runner use the coroutine, which should have its own separate stack instead of the main one. A different approach would be to start them in threads but they're not as cheap.
Oh, I see. If I understood correctly it's not that there's a non-tail-recursive loop but rather that there's a lot of nested contract calls, right? We hadn't seen any crash because of that yet.
Actually, there's only a single Starknet contract call to the Kakarot main contract. However, there are roughly 1024 EVM contract calls executed inside this test, which is perform a lot of operations. So in the end it's just a very heavy computation
Then it might be that the VM's main loop is not tail recursive.
I have a question, if the VM is more resource intensive than Native (thinking on the fact that segments can only grow), shouldn't VM execution crash as well?
Not really. As you said, the VM uses segments which are just Vec<_>
. Every .push(_)
or (safe) operation will check the vector length and act accordingly.
This is not the case in native because we're using the CPU's stack, which is just a pointer. The program doesn't know how much stack it has remaining, and even if we know I'm not sure how we could manipulate the MLIR to check that and avoid crashes.
What could work is hooking into the SIGSEGV
process signal and "guess" if it was a stack overflow or a different cause for the segfault, then try to recover the execution from there. I think I tried something like that once for a personal project, but that's definitely 100% undefined behavior which we do not want in native.
Repro: scarb 2.8.3 clone https://github.com/kkrt-labs/kakarot-ssj on main
scarb build -p contracts
- builds the required contract with some debug print logsclone https://github.com/kkrt-labs/ef-tests ensure correct branch
git checkout feat/cairo-native
make
make setup-kakarot
copy the contracts built previously:cp ../kakarot-ssj/target/dev/contracts_* build/v1
cargo test test_baseFeeDiffPlaces_d34g0v0_Cancun --features v1,native -- --nocapture
output: