integritee-network / worker

Integritee off-chain worker and sidechain validateer
Apache License 2.0
89 stars 46 forks source link

Potential state mismatch caused by failed indirect trusted_call #1232

Open Kailai-Wang opened 1 year ago

Kailai-Wang commented 1 year ago

I'm trying to raise the discussion here and I'd appreciate any feedback :)

In our opinion we should always return Ok() for the trusted_call execution based on the current implementation, rationale:

// TODO(Kai@litentry): // If this function returns Err(), it will feed the executor with Ok(ExecutedOperation::failed()), // which will remove the failed op from its own top pool while preventing it from being included // in a sidechain block - see execute_trusted_call_on_stf. // // As a result, when other workers import sidechain blocks, they will treat the op as // "not yet executed" (before it's not recorded in the sidechain block) and try to execute it again from // its own top pool (if the op is added to the top pool upon e.g. parentchain block import). // // The execution will most likely fail again. However, the state could have been changed already by applying // the state diff from the imported sidechain block. This could cause an inconsistent/mismatching state, // for example, the nonce. See the nonce handling below: we increased the nonce no matter the STF is executed // successfully or not. // // This is probably the reason why the nonce-handling test in demo_shielding_unshielding.sh sometimes fails. // // for now we should always return Ok(()) for this function and propagate the exe, at least for // litentry STFs. I believe this is the right way to go, but it still needs more discussions.

clangenb commented 1 year ago

I think, in general, we should probably adhere to substrate's convention, which is obviously not yet the case:

Include a trusted operation in the sidechain block even if it failed, but add an extrinsic failed event to the system pallet on the sidechain and hence include it in the statediff.

If our design allows that, I would prefer that we return Err() for failed trusted operations, but the executor (or better a layer in between) catches that and handles the error case.

However, there is one misconception, we do not propagate the transaction pool yet, see https://github.com/integritee-network/worker/issues/616#issuecomment-1022090746. Hence, the nonce-handling issue in the demo_shielding_unshielding.sh has probably nothing to do with that.

Kailai-Wang commented 1 year ago

but the executor (or better a layer in between) catches that and handles the error case.

Yea if we manage to do that it would be great. The problem is the trustedCall is a superset of (potentially multiple) sgx-runtime-extrinsics, where some extrinsics succeed, some can fail. But statediff is applied on a trustedCall basis (If I understand it correctly).

we do not propagate the transaction pool yet

With "propagate" do you mean tx pool gossip? Like synchronise the tx pool between peers? But my understanding is:

Not sure if I miss something though

Kailai-Wang commented 1 year ago

I just saw the title change by @brenzi , which I think makes sense.

In fact, we probably need to differentiate the handling between indirect and direct calls:

So it would be great if we could have a unified solution for both cases.

clangenb commented 1 year ago

I am not convinced that the handling should be different for both cases. Once we gossip the trusted operation pool to fellow validateers they also need to know when a trusted operation has been tried to execute. So I hope that we can just do the same in both cases:

Kailai-Wang commented 1 year ago

I gave it a second thought, yes I believe in both cases the unsuccessfully executed top hash should be included in the block.

A follow-up question would be: should we remove it from the top pool? The current implementation is:

  1. execute STF
  2. remove failed top from pool
  3. compose the sidechain block using successful tops

However, imagine block composition eventually fails, then in step2: a. if we remove it from the pool, it won't be picked up next time as it's already left out b. if we keep it in the pool, it will be re-executed until a sidechain block is successfully produced

I kind of feel imperfect in either way, it's partially because STF execution is broader than mere sgx-runtime transactions, you can add pre/post steps before/after the runtime tx to e.g. communicate with the parachain.

Edit: I believe option a) is still better, which means if it fails to execute this time, then that's it: a failure is a failure. We shouldn't try to execute it repeatedly. If the user wants to retry, then another request is needed.

To summarise, the path should be: