Potential state mismatch caused by failed indirect trusted_call

Kailai-Wang commented 1 year ago

I'm trying to raise the discussion here and I'd appreciate any feedback :)

In our opinion we should always return Ok() for the trusted_call execution based on the current implementation, rationale:

// TODO(Kai@litentry): // If this function returns Err(), it will feed the executor with Ok(ExecutedOperation::failed()), // which will remove the failed op from its own top pool while preventing it from being included // in a sidechain block - see execute_trusted_call_on_stf. // // As a result, when other workers import sidechain blocks, they will treat the op as // "not yet executed" (before it's not recorded in the sidechain block) and try to execute it again from // its own top pool (if the op is added to the top pool upon e.g. parentchain block import). // // The execution will most likely fail again. However, the state could have been changed already by applying // the state diff from the imported sidechain block. This could cause an inconsistent/mismatching state, // for example, the nonce. See the nonce handling below: we increased the nonce no matter the STF is executed // successfully or not. // // This is probably the reason why the nonce-handling test in demo_shielding_unshielding.sh sometimes fails. // // for now we should always return Ok(()) for this function and propagate the exe, at least for // litentry STFs. I believe this is the right way to go, but it still needs more discussions.

clangenb commented 1 year ago

I think, in general, we should probably adhere to substrate's convention, which is obviously not yet the case:

Include a trusted operation in the sidechain block even if it failed, but add an extrinsic failed event to the system pallet on the sidechain and hence include it in the statediff.

If our design allows that, I would prefer that we return Err() for failed trusted operations, but the executor (or better a layer in between) catches that and handles the error case.

However, there is one misconception, we do not propagate the transaction pool yet, see https://github.com/integritee-network/worker/issues/616#issuecomment-1022090746. Hence, the nonce-handling issue in the demo_shielding_unshielding.sh has probably nothing to do with that.

Kailai-Wang commented 1 year ago

but the executor (or better a layer in between) catches that and handles the error case.

Yea if we manage to do that it would be great. The problem is the trustedCall is a superset of (potentially multiple) sgx-runtime-extrinsics, where some extrinsics succeed, some can fail. But statediff is applied on a trustedCall basis (If I understand it correctly).

we do not propagate the transaction pool yet

With "propagate" do you mean tx pool gossip? Like synchronise the tx pool between peers? But my understanding is:

only successfully executed top call will be added to the sidechain block: https://github.com/integritee-network/worker/blob/6c84d2c76c13be162afe2607d1a1ef5b01707c6f/core-primitives/stf-executor/src/executor.rs#L133
upon block import, if top execution was successful, top with the same hash will be removed from top pool: https://github.com/integritee-network/worker/blob/6c84d2c76c13be162afe2607d1a1ef5b01707c6f/sidechain/consensus/aura/src/block_importer.rs#L311-L312 It means top which returns Err() will not be added to the sidechain block. Still, the statediff regarding nonce will be included in the sidechain block (as we incremented it before sgx-runtime-extrinsic execution). Parentchain block importer is independently processed in each worker, so whenever another worker finds the same indirect call, and because the corresponding top was not removed, it will try to execute it again and find that the nonce is unexpected.

Not sure if I miss something though

Kailai-Wang commented 1 year ago

I just saw the title change by @brenzi , which I think makes sense.

In fact, we probably need to differentiate the handling between indirect and direct calls:

for direct calls, failed STF execution should be removed from the pool, as it makes less sense to include it in the sidechain block and broadcast it to other nodes. Also the STF call should return Err() for synchronous subscription to the state listener.
for indirect calls, the failed call should still be included in the sidechain block to allow other nodes to detect it and skip the top execution with the same hash. Currently this can only be done by returning Ok() in every case, which is not so ideal.

So it would be great if we could have a unified solution for both cases.

clangenb commented 1 year ago

I am not convinced that the handling should be different for both cases. Once we gossip the trusted operation pool to fellow validateers they also need to know when a trusted operation has been tried to execute. So I hope that we can just do the same in both cases:

Don't include the top in a sidechain block if it fails during the prevalidation check, which doesn't really happen yet. Failing to decrypt or decode it is the only reason for such a failure that I see currently.
Include it in the sidechain block if it fails during the actual execution of the trusted operation. (Note: this is not implemented yet, returning an error does not include it in the block, which actually results in the behaviour described in this issue if it was an indirect call).

Kailai-Wang commented 1 year ago

I gave it a second thought, yes I believe in both cases the unsuccessfully executed top hash should be included in the block.

A follow-up question would be: should we remove it from the top pool? The current implementation is:

execute STF
remove failed top from pool
compose the sidechain block using successful tops

However, imagine block composition eventually fails, then in step2: a. if we remove it from the pool, it won't be picked up next time as it's already left out b. if we keep it in the pool, it will be re-executed until a sidechain block is successfully produced

I kind of feel imperfect in either way, it's partially because STF execution is broader than mere sgx-runtime transactions, you can add pre/post steps before/after the runtime tx to e.g. communicate with the parachain.

Edit: I believe option a) is still better, which means if it fails to execute this time, then that's it: a failure is a failure. We shouldn't try to execute it repeatedly. If the user wants to retry, then another request is needed.

To summarise, the path should be:

we should return Err() if the STF execution fails
the failed top should be removed from the tool
however, the failed top hash needs to be included in the sidechain block

integritee-network / worker

Potential state mismatch caused by failed indirect trusted_call #1232