Open robert-zaremba opened 11 months ago
Thanks for reporting this issue @robert-zaremba
Can you please also provide the smart contact code that triggers this problem?
This is the part where it fails in this contract.
I can confirm this is a unintended change in the runtime behavior.
I don't know the exact details, yet. But here is a branch that shows that wasmer2 (what we had before the upgrade) and near-vm (what we use since the latest upgrade) behave differently.
https://github.com/jakmeier/nearcore/tree/reproduce_9393
Everything is contained in this commit: https://github.com/jakmeier/nearcore/commit/1874b9dd3a8e834343ab76a1f4f664fc98ccb7e0
command to run test:
cargo test --package near-vm-runner --lib -- tests::runtime_errors::test_i_am_human_arithmetic_trap --exact --nocapture
result:
thread 'tests::runtime_errors::test_i_am_human_arithmetic_trap' panicked at 'Inconsistent VM Output:
NearVm:
VMOutcome: balance 8000000000000000000002 storage_usage 12 return data None burnt gas 712380595914 used gas 712380595914
Err: WebAssembly trap: An arithmetic exception, e.g. divided by zero.
Wasmer2:
VMOutcome: balance 8000000000000000000002 storage_usage 12 return data None burnt gas 7543360465350 used gas 7543360465350
Err: Smart contract panicked: signature error: invalid signature
I will search further to find out the root cause...
@jakmeier would be great to assure that workspace-rs
uses exactly the same runtime.
Could it be that you were testing this on a arm-based mac? We don't use the same runtime in production and in developer mode on a mac because our production compiler backend doesn't support arm.
I have a PoC fix ready, see this commit: https://github.com/jakmeier/nearcore/commit/1c6863e57d10abf92ccb8a166fac9bde6ff6796a
The good news upfront: This is not a security critical problem, it can't be abused. There are just some unlucky contracts that will always fail to execute some branches.
The fix is in x86 code generation for gas accounting. The trap is hit here when we overflow the addition: https://github.com/near/nearcore/blob/4b878ecc484cf498f002ef281482765d493542d1/runtime/near-vm/compiler-singlepass/src/codegen_x64.rs#L396-L397
However, the input values were small enough that they shouldn't overflow. But the cost_location
(type Imm32(u32)
) was around 3Ggas, which is larger than i32::MAX
.
So, I think what happens is that when we write this immediate value into x86 code, it just writes it as a 32bit value. But for the addition, we specifically use 64bit addition, so the immediate value gets sign-extended into a 64-bit integer. That would make sense if it was a signed integer, but for for an unsigned integer it makes no sense at all, so we end up with a huge number with the 32 most significant bits all set to 1.
To test my theory I added a condition that emits 3 smaller adds if the immediate has the most significant bit set. This way, sign extending doesn't matter. And voilà, it no longer hits the trap but still computes the correct gas cost!
I hope there is a better way to fix it. Maybe some simple way to avoid sign extending the 32bit value into a 64bit operand? My x86 knowledge never was that great and is definitely rusty these days, so not sure what our options are, or even if my reasoning here is correct.
Let's discuss this next week @nagisa, @Ekleog-NEAR, @akhi3030
@robert-zaremba I'm really sorry that you are hitting this problem. We messed this one up, our testing should have caught it. Unfortunately, it didn't. And it was only reported to us when it was already in mainnet, so our agility is very much limited at this point.
We will come up with a solution that fixes it in mainnet as soon as safely possible. I hope you understand that rushing out a fix in our compiler-backend right as the weekend starts is not a very safe option and it would put all other users at risk of potential new bugs. The safe timline will need to be discussed with the team, which is OOO today except for myself. I will post an ETA here when we I have it.
Please let me know if you need help to come up with a workaround in your smart contract to avoid the problem. That's probably the fastest way to get the NDC onboarding flow going again, if you can't wait for the fix to land in mainnet.
I hope there is a better way to fix it. Maybe some simple way to avoid sign extending the 32bit value into a 64bit operand? My x86 knowledge never was that great and is definitely rusty these days, so not sure what our options are, or even if my reasoning here is correct.
Hm, so in finite-wasm the gas parameter is expected to be a 64 bit integer in the first place so the root cause is probably
Simply replacing u32::try_from
with i32::try_from
would be a sufficient fix (plus whatever casts are necessary to stuff the resulting value into Imm32
), I believe.
Strongly agreed with regards to not releasing fixes on friday, though, especially as this would require a protocol version bump.
Thanks @jakmeier for quickly handling the problem. I think every hashing can hit the trap at some point. We will be testing one workaround today, but it would be good to solve it in the runtime ASAP.
Simply replacing u32::try_from with i32::try_from would be a sufficient fix
@nagisa why downcasting u64
value into i32
value is better than into u32
value? This is counter intuitive. If the value is too big, then with i32
we end up with a negative value, which could be a source of more serious exploits (long running transactions, maybe even infinite transactions).
@nagisa Ahh, I see. Yes, your fix is much more reasonably than what I produced :D I just finished reviewing your PR, I think we can move forward with it asap.
@robert-zaremba Since this requires a protocol upgrade, it will take a couple of days at least. Maybe by the end of the week (Sunday), we could have it effective in mainnet. But that assumes release engineers can pick it up quickly and that testing runs through smoothly, plus that validators are quick to upgrade their binaries.
I think every hashing can hit the trap at some point. We will be testing one workaround today, but it would be good to solve it in the runtime ASAP.
Hashing in the host function (in ed25519_verify
, or even just direct calls to sha256
) cannot trigger this, the gas accounting works differently there.
why downcasting u64 value into i32 value is better than into u32 value?
This isn't C, this is Rust :) If the value is too big, it will return an error, which we handle properly. The old code tried to cast to an u32
which has a range of numbers where it technically succeeds (Rust returns Ok(_)
) but the generated code is invalid.
@jakmeier wasm doesn't support unsigned integers?
@jakmeier wasm doesn't support unsigned integers?
What WASM supports or doesn't support is completely irrelevant here. The conversion from u64
to u32
or i32
in Rust is independent of the target. Also, the target here is x86, since the code doing the conversion is the compiler itself, not something that happens inside the VM runtime. The error happens at runtime, yes, but only because the compiler generated bad code.
On a side note, I would say the error "WebAssembly trap"
is somewhat misleading, too. It's not a trap in the user provided WASM code. But actually in the x86 code that we inserted between the user's code. There is no line in the original WASM that we could point to where this is caused. In fact, I would say the jump to the exit code taken in this scenario can only ever happen if our runtime has bugs, so it should probably say something like "Gas accounting error" instead.
@robert-zaremba did you manage to get a workaround going?
Since this is not a vulnerability, the normal flow to deploy this patch would be to have it included in the next testnet release and have it run for a month there before we let it loose on mainnet. But this would put the ETA more like 5-6 weeks from now.
@jakmeier not really. I've checked:
ed25519_dalek
(the original lib - hits the trap)near_crypto
(both ed25519 and secp256k1).The compilation issue is related to the wasm support (random normally is not allowed in smart contracts):
error: the wasm*-unknown-unknown targets are not supported by default, you may need to enable the "js" feature. For more information see: https://docs.rs/getrandom/#webassembly-support
--> /home/robert/.cargo/registry/src/index.crates.io-6f17d22bba15001f/getrandom-0.2.10/src/lib.rs:285:9
|
285 | / compile_error!("the wasm*-unknown-unknown targets are not supported by \
286 | | default, you may need to enable the \"js\" feature. \
287 | | For more information see: \
288 | | https://docs.rs/getrandom/#webassembly-support");
| |________________________________________________________________________^
error[E0433]: failed to resolve: use of undeclared crate or module `imp`
--> /home/robert/.cargo/registry/src/index.crates.io-6f17d22bba15001f/getrandom-0.2.10/src/lib.rs:341:9
Note that 5-6 weeks is not acceptable for us :(
The bot automatically closed this. I'm reopening it till we find a solution for @robert-zaremba.
There was quite a bit of a Zulip discussion regarding a workaround for your particular case of hitting this problem. TLDR, until the SDK has a release that includes ed25519_verify, you can copy paste this into your code and get the exactly same functionality as the SDK will provide.
mod sys {
extern "C" {
pub fn ed25519_verify(
sig_len: u64,
sig_ptr: u64,
msg_len: u64,
msg_ptr: u64,
pub_key_len: u64,
pub_key_ptr: u64,
) -> u64;
}
}
pub fn ed25519_verify(signature: &[u8; 64], message: &[u8], public_key: &[u8; 32]) -> bool {
unsafe {
sys::ed25519_verify(
signature.len() as _,
signature.as_ptr() as _,
message.len() as _,
message.as_ptr() as _,
public_key.len() as _,
public_key.as_ptr() as _,
) == 1
}
}
Describe the bug
We experienced failures in the wasm runtiem, which can be related to the recent mainnet upgrade. The particular exception happens, in our case, during ed25519 signature verification. The part which panics is during the hash computation: https://github.com/dalek-cryptography/ed25519-dalek/blob/1.0.1/src/public.rs#L344 Exception:
It was reported that it was in the testnet for more than a week.
Bug history
In the summary, to repeat what is the latest knowledge of the issue:
Dates of the issue when they first appeared
Expected behavior
It worked in during the past months, and now it panics. It should just work.
Example transaction https://explorer.near.org/transactions/7LrhUybXfyD5CXHRahyNgPCW3qw8NWUuxQwXH87fxL2
Version (please complete the following information):