Open Roasbeef opened 4 months ago
Right after posting this, I realize that we do resume the parcels here: https://github.com/lightninglabs/taproot-assets/blob/main/tapfreighter/chain_porter.go#L164-L177.
So the restart logic is fine, but we still shouldn't let inability to deliver a proof stop us from calling ConfirmParcelDelivery
. In an observed case, a user had the wrong/invalid host set for a proof courier, so a connection can't made, so proofs can never be delivered.
So the restart logic is fine, but we still shouldn't let inability to deliver a proof stop us from calling ConfirmParcelDelivery. In an observed case, a user had the wrong/invalid host set for a proof courier, so a connection can't made, so proofs can never be delivered.
@Roasbeef Here's what I think we should do:
To help ensure reliable proof courier services, the sender should verify successful ping responses from all target proof courier services before attempting the transfer. If a proof cannot be delivered to the proof courier, but the courier did respond to a ping, the existing backoff procedure should be utilized to retry delivery. Should we run out of backoff attempts, the proof must be synced to a public universe server before calling ConfirmParcelDelivery
. If the proof is not synced to a public universe server, ConfirmParcelDelivery
should not be called, and proof delivery to the target proof courier should be retried later.
This process aims to avoid the scenario where the proof is not delivered to the proof courier, not synced with a public universe server, and the parcel is marked as complete. In that scenario the receiver may never receive their proof.
What do you think?
Looking at transferReceiverProof
more closely, it looks like we don't record the transfer until proof courier upload succeeds? That seems backwards.
The transfer happened regardless of upload to a courier service. Proofs could also be sent to the receiver without a courier. Seems like we should re-order the DB call ConfirmParcelDelivery
to be before the proof courier interaction.
If that SendStateComplete
is happening early, I don't see where we record that courier delivery actually succeeded. That could be a separate column in the transfers
table IMO.
The restart logic seems to only handle transfers unconfirmed on-chain, not those pending delivery via courier.
A neater solution may be a background loop to deliver undelivered receiver transfer proofs when sent a signal by the chain porter. If we write transfer data before attempting delivery, we can always find undelivered transfer proofs on restart.
A neater solution may be a background loop to deliver undelivered receiver transfer proofs when sent a signal by the chain porter. If we write transfer data before attempting delivery, we can always find undelivered transfer proofs on restart.
Yeah I think this is the way to go. We should decouple the proof transfer from confirming on disk which updates all the relevant records and allows users to send the assets once again. With the way things are, we write the proof courier addr to disk with the transfer outputs. This means that once set, it can't be updated again. Instead, we should split this into two state effectively, where we update everything on disk then continue to transmit in the background. We can do this with another state.
New column table will be updated soon
@ffranr this is the ordering that still seems incorrect
https://github.com/lightninglabs/taproot-assets/issues/1009#issuecomment-2225938093
ConfirmParcelDelivery
(which should maybe be renamed) updates some parts of the transfer, re-anchors passive assets, and marks the anchor TX as confirmed.
IMO all of that should happen before trying to push proofs to receivers via courier. With the new DB column added, we can better represent these states, like confirmed on-chain but undelivered.
In the current ordering, if pushing via courier fails, for any receiver, ConfirmParcelDelivery
won't be called.
TBH the logic of ConfirmParcelDelivery
could be moved up to StoreProofs
, so TransferProofs
is only concerned with proof pushing and updating that single transfer column.
Reopening as the current itest tests proof courier restart but not tapd restart.
Reopening as the current itest tests proof courier restart but not tapd restart.
@jharveyb I don't think that's true, see:
Not that we don't restart tapd in general, but that we don't cover the sequence of:
I'm thinking of this specific TODO:
Background
After we confirm a transfer, and update proofs (only for the file based store), we'll attempt to upload any proofs (if needed) to a universe: https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapfreighter/chain_porter.go#L1165-L1190
The issue here is that this is run in a goroutine, with the state step function returning immediately. Upon restart, we won't attempt to publish the proofs again. Even if we don't have any proofs, this is important, as only once we transfer all the proofs, do we call
ConfirmParcelDelivery
: https://github.com/lightninglabs/taproot-assets/blob/main/tapfreighter/chain_porter.go#L795-L808.The function name is slightly overloaded, but
ConfirmParcelDelivery
is what will actually finalize the transfer by applying the pending outputs (make new assets w/ the relevant details): https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapdb/assets_store.go#L2816-L2842.Note that we do have the transfer log, and can resume from that, but this doesn't allow us to also execute this call back that finalizes the transfer.
One other relevant detail is that
transferReceiverProof
sets the state toSendStateComplete
: https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapfreighter/chain_porter.go#L814, but only after it has completed. There's a line before the call that prematurely updates this state: https://github.com/lightninglabs/taproot-assets/blob/fca9d11f027cdb5b81a36c4a8999c6b010ac0c44/tapfreighter/chain_porter.go#L1169-L1172Steps to reproduce
Update an itest to initiate a transfer, confirm, but then cause the proof transfer to fail.
Expected behavior
It should continue to both transmit the proofs, but also fix the logic above to call
ConfirmParcelDelivery
as soon as we get the confirmation notification.We should also remove the premature update of
SendStateComplete
. This'll ensure that that case gets re-executed on start up until the proofs are actually sent.With the above, we also need to read the parcels from disk, to relaunch the state machine for parcels that weren't finalized.
Actual behavior
On restart, lnd won't reattempt proof transfer. However, the proof files on disk are properly updated.