Report message dispatch status to source chain

serban300 commented 1 year ago

Related to paritytech/parity-bridges-common#2443

I think we might need to report the message dispatch status from target chain to source chain.

For example for the flow Rockmine -> BridgeHub Rococo -> BridgeHubWococo -> Wockmint we might have an XCM message that reached BridgeHubWococo, but the dispatch failed. In this case the assets won't be trapped anywhere (more details in this comment) . I think it would be good if we could report the failed dispatch to BridgeHub Rococo so that BridgeHub Rococo could then send a QueryResponse back to Rockmine to somehow unlock the assets there.

@svyatonik do you think this would make sense ?

svyatonik commented 1 year ago

A couple of questions (sorry - I'm yet to read everything in paritytech/parity-bridges-common#2443 carefully): 1) what exactly needs to be delivered back to the source chain? Is it just 1 bit of information (that we have had before) or is it some more complicated structure? If that's just one bit, we can revert removal and if that's something more complicated, then I suggest using regular messaging to report the error - i.e. on_fail: queue-error-report-message-to-the-same-lane-in-reverse-direction 2) very dumb question, just idea. When I read about XCM some time ago, I thought that this QueryResponse is sent back automatically (if ReportError is in the XCM program). IIUC on BridgeHubWococo the problem is that we are not executing anything (i.e. are not using XCM executor). But can we e.g. use XCM execute that will call the XcmBlobMessageDispatch itself. Like we already do on the source parachain - it executes some XCM that sends message (using XCMP) to sibling parachain. IIUC we can try to do similar thing here. Do you think it is possible?

serban300 commented 1 year ago

what exactly needs to be delivered back to the source chain? Is it just 1 bit of information (that we have had before) or is it some more complicated structure? If that's just one bit, we can revert removal and if that's something more complicated, then I suggest using regular messaging to report the error - i.e. on_fail: queue-error-report-message-to-the-same-lane-in-reverse-direction

Yes, I think we would only need 1 bit to report success/failure.

very dumb question, just idea. When I read about XCM some time ago, I thought that this QueryResponse is sent back automatically (if ReportError is in the XCM program). IIUC on BridgeHubWococo the problem is that we are not executing anything (i.e. are not using XCM executor). But can we e.g. use XCM execute that will call the XcmBlobMessageDispatch itself. Like we already do on the source parachain - it executes some XCM that sends message (using XCMP) to sibling parachain. IIUC we can try to do similar thing here. Do you think it is possible?

In theory QueryResponse would be sent back automatically if ReportError is in the XCM program. But from what I understand it would be sent automatically, only if the message came from a sibling parachain, because it uses send. I think we would need export to send it over the bridge, or some custom logic. We could probably call execute on the XCM message, but still the fact that ReportError uses send is a problem and probably we would need more customization for this. Also if we send it back on the bridge we might run in the same issue, namely that the dispatch might fail on the other side and the message might get lost.

svyatonik commented 1 year ago

That's a bummer :/ I thought it is all covered by XCM when I've removed that functionality. But it is kinda strange that this stuff (ReportError + QueryResponse) only works within local consensus - maybe we could ask @KiChjang if that's by design or it isn't just implemented yet? Or maybe you have idea why this isn't implemented, @serban300 ?

That said, I'm not against reverting dispatch-result-removal, if you think we need it. However, please note that this will affect the weight of receive_message_delivery_proof call, so please don't do anything complicated directly from there. Just iterate over results bitvec and save every failure to some other queue, where it'll be processed from the on_idle/on_initialize or something like that. Also: we need to impl refund in this case (like we'll be accounting worst case, which is reads_writes(1, 1) or something like that, but at least some calls will be successful => zero weight).

KiChjang commented 1 year ago

This doesn't seem like it's a huge issue for me -- you can simply implement SendXcm on a struct that sends the QueryResponse over a bridge, prepending ExportMessage as necessary, and assigning it as an XcmSender to the XCM executor config. In other words, this is more of an implementation issue than a design problem.

svyatonik commented 1 year ago

Thanks! Then I wonder why do we need a separate ExportMessage instruction at all (we could just use send instead?) + why two traits (SendXcm and ExportXcm) instead of just one (if we could send message to other consensus using SendXcm). But that's probably because I haven't explored XCM in details.

But I guess if it is fine to use send to send message to other consensus system, then probably that's a way to go @serban300 ? Or is it much more complex than relaying dispatch results using bridge-specific stuff?

serban300 commented 1 year ago

I hope I'm not missing something. I'm not very familiar with XCM, so take the following with a grain of salt, but:

In my opinion the biggest issue with ReportError is that as far as I understand the delivery is not guaranteed, and

we can't implement a retry mechanism in the XCM message
we can't do ReportError inside ReportError, and even if we could, it would be really hard to follow.
not sure if we can / would make sense to implement a retry mechanism in the XCM configs (XCM sender/MessageExporter) somehow

If we only have a communication between 2 chains let's say A and B and A wants to send some assets to B, probably ReportError is ok. If something fails on A, the assets will be trapped on A. If something fails on B, the assets will be trapped on B, and also ReportError will send a QueryResponse back, which might arrive or not, but still the assets will be trapped on B. As long as the assets are trapped somewhere we are ok-ish even if the claiming process is not automated..

But if we consider a longer distance scenario things get complicated. In our case Rockmine -> BridgeHub Rococo -> BridgeHubWococo -> Wockmint for example, if something fails on BridgeHubWococo, even if we execute the message, the assets won't be trapped there (because ReserveAssetDeposited is not executed there anyway, it should be executed on Wockmint) and the QueryResponse might not get back to Rockmine in which case we might lose the message and the assets completely.

Instead if we relay the dispatch result to BridgeHub Rococo, at least we have a stored proof that the message failed. The assets are still not trapped, but maybe as long as we hold in the storage the fact that the message failed, we can retry to send QueryResponses to Rockmine until one succeeds ? Not sure if this is possible either, but seems safer to have something in the storage at least that can be linked with the original message.

Also relaying the dispatch status would be more efficient. We would only need to send 1 bit back with the receive_message_delivery_proof which we send anyway, and we would avoid the customizations needed to make ReportError work over the bridge.

But not sure if this is the best solution. Still thinking of alternatives. Couldn't find something trivial yet. Also maybe I'm missing something. But anyway, @svyatonik could you also point me to the commit with the dispatch-result-removal please, just in case we need to revert it.

serban300 commented 1 year ago

if something fails on BridgeHubWococo, even if we execute the message, the assets won't be trapped there (because ReserveAssetDeposited is not executed there anyway, it should be executed on Wockmint) and the QueryResponse might not get back to Rockmine in which case we might lose the message and the assets completely.

Also probably it's highly unlikely, but we could get a decoding error on BridgeHubWococo, in which case we wouldn't be able to read the message at all, and even if it was a ReportError instruction inside, we wouldn't be able to execute it. But we would be able to send back dispatch_status: failed.

svyatonik commented 1 year ago

You maybe right - I have no much expertise in XCM as well. I just wanted to avoid reinventing the wheel here. Because I was expecting that everything XCM need from bridges is to deliver blobs (as it was requested last year) (c). Nothing about delivering "dispatch result" or anything like that. So if we need to get dispatch result bits back - let's do it.

That's the PR that has removed relaying dispatch result: https://github.com/paritytech/parity-bridges-common/pull/1660 You'll probably also need callbacks that are invoked when messages are sent (unlikely - you may do the same from our XcmBlobHaulerAdapter) and confirmed: https://github.com/paritytech/parity-bridges-common/pull/1649

KiChjang commented 1 year ago

In my opinion the biggest issue with ReportError is that as far as I understand the delivery is not guaranteed

This is by design actually. XCM is designed to be Asynchronous, and so none of the XCM components assume that you can expect a timely delivery. The reason behind this design decision is exactly as you've discovered -- the complexity of designing a timely message delivery system really high, and likely not generalizable either, so by default XCM assumes asynchronicity. If you want to implement a retry mechanism, then you'll have to implement it on top of XCM.

paritytech / parity-bridges-common

Report message dispatch status to source chain #2022