near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.3k stars 601 forks source link

WebAssembly trap: An arithmetic exception after mainnet and testnet upgrade #9393

Open robert-zaremba opened 11 months ago

robert-zaremba commented 11 months ago

Describe the bug

We experienced failures in the wasm runtiem, which can be related to the recent mainnet upgrade. The particular exception happens, in our case, during ed25519 signature verification. The part which panics is during the hash computation: https://github.com/dalek-cryptography/ed25519-dalek/blob/1.0.1/src/public.rs#L344 Exception:

ExecutionError: 'WebAssembly trap: An arithmetic exception, e.g. divided by zero.'

It was reported that it was in the testnet for more than a week.

Bug history

In the summary, to repeat what is the latest knowledge of the issue:

Dates of the issue when they first appeared

Expected behavior

It worked in during the past months, and now it panics. It should just work.

Example transaction https://explorer.near.org/transactions/7LrhUybXfyD5CXHRahyNgPCW3qw8NWUuxQwXH87fxL2

Version (please complete the following information):

jakmeier commented 11 months ago

Thanks for reporting this issue @robert-zaremba

Can you please also provide the smart contact code that triggers this problem?

amityadav0 commented 11 months ago

https://github.com/near-ndc/i-am-human/blob/6fa2bc56ab77cb82b39139fb170ed56ad4f75e24/contracts/oracle/src/lib.rs#L138

This is the part where it fails in this contract.

jakmeier commented 11 months ago

I can confirm this is a unintended change in the runtime behavior.

I don't know the exact details, yet. But here is a branch that shows that wasmer2 (what we had before the upgrade) and near-vm (what we use since the latest upgrade) behave differently.

https://github.com/jakmeier/nearcore/tree/reproduce_9393

Everything is contained in this commit: https://github.com/jakmeier/nearcore/commit/1874b9dd3a8e834343ab76a1f4f664fc98ccb7e0

command to run test:

cargo test --package near-vm-runner --lib -- tests::runtime_errors::test_i_am_human_arithmetic_trap --exact --nocapture

result:

thread 'tests::runtime_errors::test_i_am_human_arithmetic_trap' panicked at 'Inconsistent VM Output:
NearVm:
VMOutcome: balance 8000000000000000000002 storage_usage 12 return data None burnt gas 712380595914 used gas 712380595914
Err: WebAssembly trap: An arithmetic exception, e.g. divided by zero.

Wasmer2:
VMOutcome: balance 8000000000000000000002 storage_usage 12 return data None burnt gas 7543360465350 used gas 7543360465350
Err: Smart contract panicked: signature error: invalid signature

I will search further to find out the root cause...

robert-zaremba commented 11 months ago

@jakmeier would be great to assure that workspace-rs uses exactly the same runtime.

jakmeier commented 11 months ago

Could it be that you were testing this on a arm-based mac? We don't use the same runtime in production and in developer mode on a mac because our production compiler backend doesn't support arm.

jakmeier commented 11 months ago

I have a PoC fix ready, see this commit: https://github.com/jakmeier/nearcore/commit/1c6863e57d10abf92ccb8a166fac9bde6ff6796a

The good news upfront: This is not a security critical problem, it can't be abused. There are just some unlucky contracts that will always fail to execute some branches.

The fix is in x86 code generation for gas accounting. The trap is hit here when we overflow the addition: https://github.com/near/nearcore/blob/4b878ecc484cf498f002ef281482765d493542d1/runtime/near-vm/compiler-singlepass/src/codegen_x64.rs#L396-L397

However, the input values were small enough that they shouldn't overflow. But the cost_location (type Imm32(u32)) was around 3Ggas, which is larger than i32::MAX.

So, I think what happens is that when we write this immediate value into x86 code, it just writes it as a 32bit value. But for the addition, we specifically use 64bit addition, so the immediate value gets sign-extended into a 64-bit integer. That would make sense if it was a signed integer, but for for an unsigned integer it makes no sense at all, so we end up with a huge number with the 32 most significant bits all set to 1.

To test my theory I added a condition that emits 3 smaller adds if the immediate has the most significant bit set. This way, sign extending doesn't matter. And voilà, it no longer hits the trap but still computes the correct gas cost!

I hope there is a better way to fix it. Maybe some simple way to avoid sign extending the 32bit value into a 64bit operand? My x86 knowledge never was that great and is definitely rusty these days, so not sure what our options are, or even if my reasoning here is correct.

Let's discuss this next week @nagisa, @Ekleog-NEAR, @akhi3030

jakmeier commented 11 months ago

@robert-zaremba I'm really sorry that you are hitting this problem. We messed this one up, our testing should have caught it. Unfortunately, it didn't. And it was only reported to us when it was already in mainnet, so our agility is very much limited at this point.

We will come up with a solution that fixes it in mainnet as soon as safely possible. I hope you understand that rushing out a fix in our compiler-backend right as the weekend starts is not a very safe option and it would put all other users at risk of potential new bugs. The safe timline will need to be discussed with the team, which is OOO today except for myself. I will post an ETA here when we I have it.

Please let me know if you need help to come up with a workaround in your smart contract to avoid the problem. That's probably the fastest way to get the NDC onboarding flow going again, if you can't wait for the fix to land in mainnet.

nagisa commented 11 months ago

I hope there is a better way to fix it. Maybe some simple way to avoid sign extending the 32bit value into a 64bit operand? My x86 knowledge never was that great and is definitely rusty these days, so not sure what our options are, or even if my reasoning here is correct.

Hm, so in finite-wasm the gas parameter is expected to be a 64 bit integer in the first place so the root cause is probably

https://github.com/near/nearcore/blob/4b878ecc484cf498f002ef281482765d493542d1/runtime/near-vm/compiler-singlepass/src/codegen_x64.rs#L351-L360

Simply replacing u32::try_from with i32::try_from would be a sufficient fix (plus whatever casts are necessary to stuff the resulting value into Imm32), I believe.

Strongly agreed with regards to not releasing fixes on friday, though, especially as this would require a protocol version bump.

robert-zaremba commented 11 months ago

Thanks @jakmeier for quickly handling the problem. I think every hashing can hit the trap at some point. We will be testing one workaround today, but it would be good to solve it in the runtime ASAP.

robert-zaremba commented 11 months ago

Simply replacing u32::try_from with i32::try_from would be a sufficient fix

@nagisa why downcasting u64 value into i32 value is better than into u32 value? This is counter intuitive. If the value is too big, then with i32 we end up with a negative value, which could be a source of more serious exploits (long running transactions, maybe even infinite transactions).

jakmeier commented 11 months ago

@nagisa Ahh, I see. Yes, your fix is much more reasonably than what I produced :D I just finished reviewing your PR, I think we can move forward with it asap.

@robert-zaremba Since this requires a protocol upgrade, it will take a couple of days at least. Maybe by the end of the week (Sunday), we could have it effective in mainnet. But that assumes release engineers can pick it up quickly and that testing runs through smoothly, plus that validators are quick to upgrade their binaries.

I think every hashing can hit the trap at some point. We will be testing one workaround today, but it would be good to solve it in the runtime ASAP.

Hashing in the host function (in ed25519_verify, or even just direct calls to sha256) cannot trigger this, the gas accounting works differently there.

why downcasting u64 value into i32 value is better than into u32 value?

This isn't C, this is Rust :) If the value is too big, it will return an error, which we handle properly. The old code tried to cast to an u32 which has a range of numbers where it technically succeeds (Rust returns Ok(_)) but the generated code is invalid.

robert-zaremba commented 11 months ago

@jakmeier wasm doesn't support unsigned integers?

jakmeier commented 11 months ago

@jakmeier wasm doesn't support unsigned integers?

What WASM supports or doesn't support is completely irrelevant here. The conversion from u64 to u32 or i32 in Rust is independent of the target. Also, the target here is x86, since the code doing the conversion is the compiler itself, not something that happens inside the VM runtime. The error happens at runtime, yes, but only because the compiler generated bad code.

On a side note, I would say the error "WebAssembly trap" is somewhat misleading, too. It's not a trap in the user provided WASM code. But actually in the x86 code that we inserted between the user's code. There is no line in the original WASM that we could point to where this is caused. In fact, I would say the jump to the exit code taken in this scenario can only ever happen if our runtime has bugs, so it should probably say something like "Gas accounting error" instead.

jakmeier commented 11 months ago

@robert-zaremba did you manage to get a workaround going?

Since this is not a vulnerability, the normal flow to deploy this patch would be to have it included in the next testnet release and have it run for a month there before we let it loose on mainnet. But this would put the ETA more like 5-6 weeks from now.

robert-zaremba commented 11 months ago

@jakmeier not really. I've checked:

The compilation issue is related to the wasm support (random normally is not allowed in smart contracts):

error: the wasm*-unknown-unknown targets are not supported by default, you may need to enable the "js" feature. For more information see: https://docs.rs/getrandom/#webassembly-support
   --> /home/robert/.cargo/registry/src/index.crates.io-6f17d22bba15001f/getrandom-0.2.10/src/lib.rs:285:9
    |
285 | /         compile_error!("the wasm*-unknown-unknown targets are not supported by \
286 | |                         default, you may need to enable the \"js\" feature. \
287 | |                         For more information see: \
288 | |                         https://docs.rs/getrandom/#webassembly-support");
    | |________________________________________________________________________^

error[E0433]: failed to resolve: use of undeclared crate or module `imp`
   --> /home/robert/.cargo/registry/src/index.crates.io-6f17d22bba15001f/getrandom-0.2.10/src/lib.rs:341:9
robert-zaremba commented 11 months ago

Note that 5-6 weeks is not acceptable for us :(

akhi3030 commented 11 months ago

The bot automatically closed this. I'm reopening it till we find a solution for @robert-zaremba.

jakmeier commented 11 months ago

There was quite a bit of a Zulip discussion regarding a workaround for your particular case of hitting this problem. TLDR, until the SDK has a release that includes ed25519_verify, you can copy paste this into your code and get the exactly same functionality as the SDK will provide.

mod sys {
  extern "C" {
      pub fn ed25519_verify(
          sig_len: u64,
          sig_ptr: u64,
          msg_len: u64,
          msg_ptr: u64,
          pub_key_len: u64,
          pub_key_ptr: u64,
      ) -> u64;
  }
}

pub fn ed25519_verify(signature: &[u8; 64], message: &[u8], public_key: &[u8; 32]) -> bool {
    unsafe {
        sys::ed25519_verify(
            signature.len() as _,
            signature.as_ptr() as _,
            message.len() as _,
            message.as_ptr() as _,
            public_key.len() as _,
            public_key.as_ptr() as _,
        ) == 1
    }
}