paritytech / parity-bridges-common

Collection of Useful Bridge Building Tools 🏗️
GNU General Public License v3.0
271 stars 130 forks source link

XCM export_message() report errors back to statemine (to release blocked funds) #2443

Closed EmmanuellNorbertTulbure closed 1 year ago

EmmanuellNorbertTulbure commented 1 year ago

To recover from bridge errors.

serban300 commented 1 year ago

Taking a look.

bkontur commented 1 year ago

maybe this looks interesting XcmQueryHandler https://github.com/paritytech/polkadot/pull/6900

serban300 commented 1 year ago

Looks like there are 2 error management mechanisms in XCM:

More details here: https://polkadot.network/blog/xcm-part-three-execution-and-error-management

Trying to understand more in-depth how they work, how we can use them and at which chains.

serban300 commented 1 year ago

From what I understand, the Asset Trap works like this: if an XCM message fails while there are some assets in the holding register, those assets will be saved and will be kept on chain.

After that they can be reclaimed using the ClaimAsset instruction. The ClaimAsset instruction removes them from the on-chain storage and puts them back in the holding register enabling the caller to perform other XCM operations on them (for example to move them into his account).

Trying to simulate this scenario over a bridge and see exactly how it works.

serban300 commented 1 year ago

maybe this looks interesting XcmQueryHandler paritytech/polkadot#6900

Thanks ! This seems related to Error reporting, which we might need as well. Will investigate it also. We might need a combination of Error reporting and Asset trap.

serban300 commented 1 year ago

Managed to test this locally on rockmine2.

  1. Executed on rockmine2 using the Alice dev account:
    RuntimeCall::PolkadotXcm(pallet_xcm::Call::<Runtime>::execute {
    message: Box::new(xcm::VersionedXcm::V3(Xcm(vec![
        Instruction::<RuntimeCall>::WithdrawAsset(
            vec![MultiAsset {
                id: AssetId::Concrete(MultiLocation {
                    parents: 1,
                    interior: Junctions::Here,
                }),
                fun: Fungibility::Fungible(1000000000000),
            }]
            .into(),
        ),
    ]))),
    max_weight: Weight::from_parts(10000000, 0),
    });

This withdraws 1 KSM from Alice's account in rockmine2. Since we don't do anything with this amount, it is present in the holding register after executing the xcm instruction and the asset trap mechanism will save it on-chain in the rockmine2 storage

  1. Executed on rockmine2 using the Alice dev account:
    RuntimeCall::PolkadotXcm(pallet_xcm::Call::<Runtime>::execute {
    message: Box::new(xcm::VersionedXcm::V3(Xcm(vec![
        Instruction::<RuntimeCall>::ClaimAsset {
            assets: vec![MultiAsset {
                id: AssetId::Concrete(MultiLocation {
                    parents: 1,
                    interior: Junctions::Here,
                }),
                fun: Fungibility::Fungible(1000000000000),
            }]
            .into(),
            ticket: MultiLocation { parents: 0, interior: Junctions::Here },
        },
        Instruction::<RuntimeCall>::DepositAsset {
            assets: MultiAssetFilter::Wild(WildMultiAsset::All),
            beneficiary: MultiLocation {
                parents: 0,
                interior: Junctions::X1(Junction::AccountId32 {
                    network: None,
                    id: hex_literal::hex!(
                        "d43593c715fdd31c61141abd04a99fd6822c8558854ccde39a5684e7a56da27d"
                    )
                    .into(),
                }),
            },
        },
    ]))),
    max_weight: Weight::from_parts(10000000, 0),
    });

This claims the 1 KSM that was trapped and transfers it back to Alice's account on rockmine2.

However there are 2 things to note:

  1. The trapped assets are stored in a hashed form on-chain. So we need to claim the entire asset. If 1000000000000 units were trapped we can't make a partial claim for 1000 for example, because it won't be recognized. So we need to know the exact amount that was trapped and this can be affected by fees that are payed on the way. We need to check if and how we can find out the exact trapped amount in case of a failed xcm transaction over the bridge.

  2. By mistake I happened to execute a ClaimAsset instruction followed by a DepositAsset to a wrong account. I got an xcm_error: FailedToTransactAsset("AccountIdConversionFailed"), but I think I lost the asset. At least it didn't seem to be trapped again and I couldn't find a way to claim it. I will check this further.

Other things still left to be done:

serban300 commented 1 year ago

There is a logged issue related to losing assets when DepositAsset fails: https://github.com/paritytech/polkadot-sdk/issues/912

The problem is that the assets are removed from the holding register and in case of an error they are not put back. And it doesn't happen only for DepositAsset.

A solution would be to use a custom AssetTransactor that accounts for these lost assets.

serban300 commented 1 year ago

Another thing to pay attention to is this condition here.

There is a "barrier" condition that is checked before processing the XCM message. If this check fails, the XCVM won't be even started, post_process() will not be called and the assets will be lost (they won't be trapped).

We need to pay attention how to craft the messages in order to avoid triggering this barrier.

serban300 commented 1 year ago

maybe this looks interesting XcmQueryHandler paritytech/polkadot#6900

Thanks ! This seems related to Error reporting, which we might need as well. Will investigate it also. We might need a combination of Error reporting and Asset trap.

Started to look into Error reporting. From what I understand both ReportError and QueryResponse can only report the status. The assets will still remain trapped on the chain where the execution error occured. And we wouldn't be able to do a combination of ClaimAssets + ReportError/QueryResponse in order to report back to the source chain and unlock the assets automatically, because the Asset Trap is executed after the error handler or the appendix. And probably we wouldn't know how many assets to claim anyway.

Another interesting thing is that the AssetTrap is configurable. Thinking if we could use that.

serban300 commented 1 year ago

Trying to summarize the entire flow, to have a better overall picture. This is how it looks like at the present time. It might change. Some items still need research. I will update this comment as the research progresses.

Let's take for example an asset transfer from Rockmine -> BridgeHub Rococo -> BridgeHubWococo -> Wockmint

  1. On Rockmine:

The user calls an extrinsic. This extrinsic reserves the assets, prepares the XCM message and sends the XCM message to BridgeHub Rococo. We don't execute any XCM instruction here. The message will be executed as it reaches the next parachains. The message contains a ReserveAssetDeposited instruction that will be executed on Wockmint, which will put the assets in the holding register and a DepositAsset instruction that will move them in the user's account on Wockmint.

Possible failures that would involve the assets: BridgeXcmSender::validate() and BridgeXcmSender::deliver() are called after reserving the assets. Maybe some other small methods. Anyway, nothing related to the actual XCM message. The message will be executed as it reaches the next parachains.

Mitigations: I think we can call BridgeXcmSender::validate() before reserving the assets, and maybe some other calls. But BridgeXcmSender::deliver() needs to be executed after, no matter what. Anyway, we should handle the errors in the code and unlock the assets.

  1. On BridgeHubRococo:

Here the XCM message is received and we execute the ExportMessage instruction, which should send the message over the bridge to BridgeHubWococo, encoded as a blob.

Possible failures that would involve the assets: Here, the ReserveAssetDeposited is not called yet, so any failure would lead to losing the assets. They won't even be caught by the Asset Trap, because they are not in the holding register. Failures could involve the execution of ExportMessage, or any bridge failures.

Mitigations: TODO, STILL RESEARCHING. Here we might be able to send a QueryResponse back to Rockmine if the XCM fails.

  1. On BridgeHubWococo:

Here the XCM message is received as an encoded blob, the bridge logic decodes it and sends it to Wockmint. No XCM instruction is executed.

Possible failures that would involve the assets: Here, no XCM instruction is executed, so any failure would lead to losing the assets. They won't even be caught by the Asset Trap, because they are not in the holding register. Mainly there can be problems decoding the blob, or sending the message to Wockmint, but I think the chances to get an error here are slim.

Mitigations: TODO, STILL RESEARCHING

  1. On Wockmint:

Here we finally execute the last part of the XCM message: ReserveAssetDeposited + DepositAsset.

Possible failures that would involve the assets: Failures can be related to XCM message problems or the configuration of the pallet_xcm here. Maybe fees withdrawal, bad origin, etc. If any error occurs after ReserveAssetDeposited, the assets will be caught by the Asset Trap and an Event will be stored related to that. The event will contain the exact amount of assets that were traped.

Mitigations: Rockmint's account (or whatever the owner of the message was) will be able to execute a ClaimAsset instruction and reclaim the assets. Also a QueryResponse instruction could be sent back to Rockmine. The question is if we could automate the asset recovery here. TODO, STILL RESEARCHING

Apart from all these we have to be mindful of the XCM barrier on each runtime where XCM instructions are executed: https://github.com/paritytech/parity-bridges-common/issues/2443

serban300 commented 1 year ago

As per the guidance from the XCM team:

serban300 commented 1 year ago

Considering that the bridge message queue satisfies all the XCM design assumptions (tracked in paritytech/polkadot#2038), I can think of a single issue here: halting the bridge. If the bridge is halted, the in-flight messages might get lost. Thinking of solutions.

EmmanuellNorbertTulbure commented 1 year ago

@serban300 document the conclusion

serban300 commented 1 year ago

The conclusion is the one above: https://github.com/paritytech/parity-bridges-common/issues/2443#issuecomment-1692857792

Closing the issue