tapfreighter/[bug]: universe proof transfer not resumed after restart

Roasbeef commented 4 months ago

Background

After we confirm a transfer, and update proofs (only for the file based store), we'll attempt to upload any proofs (if needed) to a universe: https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapfreighter/chain_porter.go#L1165-L1190

The issue here is that this is run in a goroutine, with the state step function returning immediately. Upon restart, we won't attempt to publish the proofs again. Even if we don't have any proofs, this is important, as only once we transfer all the proofs, do we call ConfirmParcelDelivery: https://github.com/lightninglabs/taproot-assets/blob/main/tapfreighter/chain_porter.go#L795-L808.

The function name is slightly overloaded, but ConfirmParcelDelivery is what will actually finalize the transfer by applying the pending outputs (make new assets w/ the relevant details): https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapdb/assets_store.go#L2816-L2842.

Note that we do have the transfer log, and can resume from that, but this doesn't allow us to also execute this call back that finalizes the transfer.

One other relevant detail is that transferReceiverProof sets the state to SendStateComplete: https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapfreighter/chain_porter.go#L814, but only after it has completed. There's a line before the call that prematurely updates this state: https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapfreighter/chain_porter.go#L1169-L1172

Steps to reproduce

Update an itest to initiate a transfer, confirm, but then cause the proof transfer to fail.

Expected behavior

It should continue to both transmit the proofs, but also fix the logic above to call ConfirmParcelDelivery as soon as we get the confirmation notification.

We should also remove the premature update of SendStateComplete. This'll ensure that that case gets re-executed on start up until the proofs are actually sent.

With the above, we also need to read the parcels from disk, to relaunch the state machine for parcels that weren't finalized.

Actual behavior

On restart, lnd won't reattempt proof transfer. However, the proof files on disk are properly updated.

Roasbeef commented 4 months ago

Right after posting this, I realize that we do resume the parcels here: https://github.com/lightninglabs/taproot-assets/blob/main/tapfreighter/chain_porter.go#L164-L177.

So the restart logic is fine, but we still shouldn't let inability to deliver a proof stop us from calling ConfirmParcelDelivery. In an observed case, a user had the wrong/invalid host set for a proof courier, so a connection can't made, so proofs can never be delivered.

ffranr commented 4 months ago

So the restart logic is fine, but we still shouldn't let inability to deliver a proof stop us from calling ConfirmParcelDelivery. In an observed case, a user had the wrong/invalid host set for a proof courier, so a connection can't made, so proofs can never be delivered.

@Roasbeef Here's what I think we should do:

To help ensure reliable proof courier services, the sender should verify successful ping responses from all target proof courier services before attempting the transfer. If a proof cannot be delivered to the proof courier, but the courier did respond to a ping, the existing backoff procedure should be utilized to retry delivery. Should we run out of backoff attempts, the proof must be synced to a public universe server before calling ConfirmParcelDelivery. If the proof is not synced to a public universe server, ConfirmParcelDelivery should not be called, and proof delivery to the target proof courier should be retried later.

This process aims to avoid the scenario where the proof is not delivered to the proof courier, not synced with a public universe server, and the parcel is marked as complete. In that scenario the receiver may never receive their proof.

What do you think?

jharveyb commented 4 months ago

Looking at transferReceiverProof more closely, it looks like we don't record the transfer until proof courier upload succeeds? That seems backwards.

The transfer happened regardless of upload to a courier service. Proofs could also be sent to the receiver without a courier. Seems like we should re-order the DB call ConfirmParcelDelivery to be before the proof courier interaction.

If that SendStateComplete is happening early, I don't see where we record that courier delivery actually succeeded. That could be a separate column in the transfers table IMO.

The restart logic seems to only handle transfers unconfirmed on-chain, not those pending delivery via courier.

A neater solution may be a background loop to deliver undelivered receiver transfer proofs when sent a signal by the chain porter. If we write transfer data before attempting delivery, we can always find undelivered transfer proofs on restart.

Roasbeef commented 4 months ago

A neater solution may be a background loop to deliver undelivered receiver transfer proofs when sent a signal by the chain porter. If we write transfer data before attempting delivery, we can always find undelivered transfer proofs on restart.

Yeah I think this is the way to go. We should decouple the proof transfer from confirming on disk which updates all the relevant records and allows users to send the assets once again. With the way things are, we write the proof courier addr to disk with the transfer outputs. This means that once set, it can't be updated again. Instead, we should split this into two state effectively, where we update everything on disk then continue to transmit in the background. We can do this with another state.

dstadulis commented 4 months ago

New column table will be updated soon

jharveyb commented 3 months ago

@ffranr this is the ordering that still seems incorrect

https://github.com/lightninglabs/taproot-assets/issues/1009#issuecomment-2225938093

ConfirmParcelDelivery (which should maybe be renamed) updates some parts of the transfer, re-anchors passive assets, and marks the anchor TX as confirmed.

IMO all of that should happen before trying to push proofs to receivers via courier. With the new DB column added, we can better represent these states, like confirmed on-chain but undelivered.

In the current ordering, if pushing via courier fails, for any receiver, ConfirmParcelDelivery won't be called.

TBH the logic of ConfirmParcelDelivery could be moved up to StoreProofs, so TransferProofs is only concerned with proof pushing and updating that single transfer column.

jharveyb commented 3 months ago

Reopening as the current itest tests proof courier restart but not tapd restart.

ffranr commented 3 months ago

Reopening as the current itest tests proof courier restart but not tapd restart.

@jharveyb I don't think that's true, see:

https://github.com/lightninglabs/taproot-assets/blob/420f2467a6770a9c6042159dcd4c3ce194db25c2/itest/send_test.go#L378-L384

jharveyb commented 3 months ago

Not that we don't restart tapd in general, but that we don't cover the sequence of:

Begin transfer
Take proof courier offline
Take tapd offline
Restart tapd
Restart proof courier
Assert that transfer completes

I'm thinking of this specific TODO:

https://github.com/lightninglabs/taproot-assets/blob/420f2467a6770a9c6042159dcd4c3ce194db25c2/itest/send_test.go#L1192

lightninglabs / taproot-assets