movementlabsxyz / movement

The Movement Network is a Move-based L2 on Ethereum.

Apache License 2.0

82 stars 65 forks source link

[Bridge] [Relayer] [Finality] Relay Events in finalised blocks only #838

Open franck44 opened 1 week ago

franck44 commented 1 week ago

Problem

Our bridge design relays event from Ethereum mainnet to our L2. These events are collected from logs, and logs are stored in blocks. A block is permanent (irreversible) when it finalised which takes on average ~15 minutes.

[!WARNING] If we relay events that are in non-finalised blocks, we bear a risk of a re-org . A re-org results in recent (non-finalised) blocks being "removed" from the canonical chain and "replaced" by other blocks. Relaying events from non-finalised blocks may result in attacks e.g. with the following steps:

on L1, user1 sends a transaction $t$ to lock some assets in our bridge; $t$ is successful and it is included in a block $b$,

the relayer notices the corresponding emitted event on the L1 logs in the non-finalised block $b$B and relays it the L2 to complete the transfer,

we mint some assets on the L2 and send them to user1 (their account on L2),

the L1 block $b$ is invalidated because of a re-org; in the new block $b'$ that $t$ is included in, the transaction $t$ fails (no assets are transferred to the bridge on L1). Potential attack: This results in user1 receiving some assets on L2 without locking the corresponding amount in our bridge.

You may be interested in this post for a primer on Ethereum events. For re-orgs, Alchemy has an explainer.

We may dismiss the previous scenario arguing that re-orgs are infrequent. However, since EIP-4844, re-orgs are more frequent according to this analysis.

[!CAUTION] It looks like re-orgs can occur at a rate of $0.3\%$, i.e. 3 blocks per 1000 blocks. A block is produced every ~12 seconds, so 1000 blocks correspond to $12 \times 1000 = 12000$ seconds which is roughly 3h 20mins. So we may have a re-org on average every 1h 6mins.

Unfortunately there is no clear results on how deep a re-org is, i.e. how many non-finalised blocks it can impact.

[!NOTE] The same issue exists when bridging from L2 to L1 and we need to make sure that L2 blocks ww relay events from are finalised.

Proposal

The proposal is to protect against the attack described above.

[!IMPORTANT] In order to mitigate the previous attack, we need to relay events that are in finalised blocks only.

Implementation

It looks like the current implementation of the relayer relies on block confirmations to relay events. A block $b$ is k-confirmed if k blocks have been produced and appended to $b$, i.e. there is a chain of k blocks that are children of $b$.

Some exchanges rely on a confirmation of 20 or 30. Etherscan displays the real status of blocks, "unfinalised/unfinalised(safe)/finalised".

If there is API to query the status of a block we may use it to verify that blocks we relay events from are finalised.

Validation

The implementation impacts the relayer and some new tests may be needed to validate the changes.

franck44 commented 1 week ago

@musitdev

Here is a summary of our discussion on this issue:

we relay events from finalised blocks only on L1 and L2
this results in ~12 to 15mins for a bridge transfer L1 -> L2, and probably a similar time (depending on the policy we adopt) from L2 -> L1.

[!IMPORTANT] So perhaps it needs some update in the UI so that the user can be informed on how long they have to wait.

To implement this solution we need to retrieve the status of a block (finalised or not).

on the Eth side,

The call eth_getBlockByNumber with finalized tag allow to get the last finalized block.
on the Mvnt side:

we open a Rest API port that only return finalized block with the same API. We just have to request block in finalized state.

The locations of the changes in the code base are:

Primata commented 1 week ago

Agree with that. Don't think that considering Safe blocks as an option though. This is going to delay us by quite a bit in terms of development and it completely changes the expected behavior and business should weight in.

Safe blocks are +- 25 blocks away from current block (5 minutes) Finalized blocks are +- 65 blocks away from current block (13 minutes)

Using either safe blocks or finalized blocks won't make much of a difference in terms of UX.

If we do not have time to implement this in time, believe we want to increase the number of block.confirmations to 3 instead of 1.

apenzk commented 1 week ago

Don't think that considering Safe blocks as an option though. This is going to delay us by quite a bit in terms of development and it completely changes the expected behavior and business should weight in.

if we can implement block.confirmations=3 we can do block.confirmations=32 .. This does not return the finalization level but it would be approximatelly correct

In blockchain security has priority thus the focus should be on safety (high confirmation number initially) and optimise UX progressively (decrease confirmation number thereafter) if we feel that it safe to do so - rather than the other way around. After all for fees we approaching it the same way (start conservative and then improve)

tbh this is Ethereum. Everyone is aware that finality on Eth takes O(10min).

Primata commented 1 week ago

@apenzk The issue with 32 blocks is that we have to make significant changes to the UI and UX, how transactions are managed in the frontend. Current design considers a single transaction in a single go and then you can complete it on transaction history. We might need some design components and be a behind schedule. Changing to 3 confirmations keeps the same UI/UX. There's where I'm worried we might be late. I agree that it has to be done though.

andygolay commented 1 week ago

From here https://etherscan.io/blocks_forked

It looks like a block is forked, that is, dropped from the blockchain, on average about 10 - 25 times per day.

I set the view to 100 rows and looked at several pages. There were only reorgs of depth 1. Double block reorgs may happen sometimes, but they don't appear to happen often.

So if we know we can expect that many blocks dropped per day, then we can impose certain limits for the vast majority of bridge users, to minimize risk and loss while prioritizing good UX along with security.

We could also have a "white glove" or more private bridge designed for larger amounts, that requires the full 32 blocks.

The attack described in this issue would only result in loss of funds to the bridge. If the bridge:

limits the size of transfers
limits the number of transfers per block
rate-limits users
starts the bridge with some extra fee to cover reorg losses, and gradually lower the fee as Movement Foundation feels comfortable doing so
requires 3 confirmations to cover the risk of 1 to 2 block reorgs

then I think Movement Foundation could safely and predictably cover losses, even in the rare event of a many-block reorg, and we could still have a fast bridge.

Optimizing for good UX is uncommon in the crypto world, and as "Movement" I do feel that for brand alignment, products should try to work fast and smoothly. So when considering security first, I believe we can use economic and probability-oriented measures such as those listed above, to also achieve a fast, enjoyable bridging UX.

There may be other attacks that should be considered in the event of a reorg. For example:

When bridging from L2 to L1, the block containing a lockBridgeTransfer or completeBridgeTransfer transaction is dropped. Solution: a service is set up to monitor for dropped blocks and bridge transactions within them. If a faulty call is found, it is matched against an initiate_bridge_transfer call on L2. After verifying the initiator is correct, they are refunded automatically on L2.
If a refund transaction is in a dropped block, the the service monitoring for reorgs will notice, and re-attempt the refund. This could be manually approved by an admin.

One factor to consider is whether there would be a token supply increase in the case of the user receiving their tokens on L2 without locking on L1. If the assets are minted on L2, which is the current model, then we would need to consider whether to burn corresponding assets on L1 or some other solution, to balance the supply according to whatever tokenomic model is in place.

apenzk commented 1 week ago

So if we know we can expect that many blocks dropped per day, then ....

the chance for dropping cannot be known in advance. we cannot predict if many or none blocks are dropped.

I set the view to 100 rows and looked at several pages. There were only reorgs of depth 1. Double block reorgs may happen sometimes, but they don't appear to happen often.

We cannot rely on heuristics like this. If you have some report and study please refer to that, otherwise this kind of guesstimate sounds super unsafe.

The attack described in this issue would only result in loss of funds to the bridge. If the bridge:

limits the size of transfers

i would not recommend this. it just incentivizes to split bridge transfers into smaller portions.

We could also have a "white glove" or more private bridge designed for larger amounts, that requires the full 32 blocks.

👍 could we start with the "larger amount" being 0 ?

rate-limits users

just incentivizes to create new accounts

starts the bridge with some extra fee to cover reorg losses, and gradually lower the fee as Movement Foundation feels comfortable doing so

bridge losses would not come expectedly or regularly.. you could not have a loss for 2 months and then ...

Movement Foundation could safely and predictably cover losses,

you cannot safely no predictably cover such losses

Optimizing for good UX is uncommon in the crypto world, and as "Movement" I do feel that for brand alignment, products should try to work fast and smoothly.

yes but please not at the cost of safety

There may be other attacks that should be considered in the event of a reorg

yes. that is a good point. the relayer could check whether "completes" or "locks" were successful

One factor to consider is whether there would be a token supply increase in the case of the user receiving their tokens on L2 without locking on L1. If the assets are minted on L2, which is the current model, then we would need to consider whether to burn corresponding assets on L1 or some other solution, to balance the supply according to whatever tokenomic model is in place.

There is this MIP, which proposes to have a security fund in case of catastrophic failures.. but just to remind this is for catastrophic events. double spends should never occur. the security fund is proposed with the desire to NOT be used at all.

andygolay commented 1 week ago

So if we know we can expect that many blocks dropped per day, then ....

the chance for dropping cannot be known in advance. we cannot predict if many or none blocks are dropped.

I set the view to 100 rows and looked at several pages. There were only reorgs of depth 1. Double block reorgs may happen sometimes, but they don't appear to happen often.

We cannot rely on heuristics like this. If you have some report and study please refer to that, otherwise this kind of guesstimate sounds super unsafe.

The attack described in this issue would only result in loss of funds to the bridge. If the bridge:

limits the size of transfers

i would not recommend this. it just incentivizes to split bridge transfers into smaller portions.

We could also have a "white glove" or more private bridge designed for larger amounts, that requires the full 32 blocks.

👍 could we start with the "larger amount" being 0 ?

rate-limits users

just incentivizes to create new accounts

starts the bridge with some extra fee to cover reorg losses, and gradually lower the fee as Movement Foundation feels comfortable doing so

bridge losses would not come expectedly or regularly.. you could not have a loss for 2 months and then ...

Movement Foundation could safely and predictably cover losses,

you cannot safely no predictably cover such losses

Optimizing for good UX is uncommon in the crypto world, and as "Movement" I do feel that for brand alignment, products should try to work fast and smoothly.

yes but please not at the cost of safety

There may be other attacks that should be considered in the event of a reorg

yes. that is a good point. the relayer could check whether "completes" or "locks" were successful

One factor to consider is whether there would be a token supply increase in the case of the user receiving their tokens on L2 without locking on L1. If the assets are minted on L2, which is the current model, then we would need to consider whether to burn corresponding assets on L1 or some other solution, to balance the supply according to whatever tokenomic model is in place.

There is this MIP, which proposes to have a security fund in case of catastrophic failures.. but just to remind this is for catastrophic events. double spends should never occur. the security fund is proposed with the desire to NOT be used at all.

If there's going to be an insistence on 32 blocks (over 6 minutes on average) per Eth transaction, for this bridge design, then that makes the bridge pretty much unusable in the context of our current UI. No one will wait 6 minutes for their L2 wallet to pop up and complete on L2. It's already some friction when it's fast, but if it's slow it just won't be used.

So in that case I would favor some attestor-based model or using LayerZero. I agree with @0xPrimata that it would be good for the business side to weigh in with company priorities for whether and how we want to continue rolling out this HTLC bridge model.

For testnet, I don't see why it would be so risky to try it with 3 confirmations, and again, if there's a max amount of value bridged per block, then I do think the amount of loss can be capped in a way that is manageable by the Movement Foundation. But again it depends on company priorities so I think it would be best to get input from business on it.

Realistically, there's no such thing as perfect safety; there's just risk management. I think it should be up to the Movement Foundation to determine what risk profile they're willing to tolerate.

franck44 commented 1 week ago

If there's going to be an insistence on 32 blocks (over 6 minutes on average) per Eth transaction, for this bridge design, then that makes the bridge pretty much unusable in the context of our current UI. No one will wait 6 minutes for their L2 wallet to pop up and complete on L2. It's already some friction when it's fast, but if it's slow it just won't be used.

So IMHO, UX should not prevail and guide the backend design. The backend should be secure. On Optimism, it takes 1-3mins to bridge to Opt and a week to bridge back. On Arbitrum.) it takes 15-30mins to bridge to L2 and a week to bridge back. On zkSync Era , bridging is ~15mins (time to finalisation on Eth). On Linea it takes ~20mins.

[!IMPORTANT] If it takes ~15mins to bridge to Mvnt, we are on par with the main chains.

Realistically, there's no such thing as perfect safety; there's just risk management. I think it should be up to the Movement Foundation to determine what risk profile they're willing to tolerate.

Yes that's true, and our role is to provide data to make informed decision.

One strong point for using finalisation as a criterion for relaying is that it is stable. If our use confirmations (12, 32, 65) the security guarantees may change over time, depending on Ethereum upgrades. for example the introduction ob blobs introduced frequent re-orgs.

[!IMPORTANT] Finalisation is a stable criterion that is safe and does not change over time (the time to finalisation can change and become shorter in the future, that's what is expected).

If we want to be a safe chain (like Arbitrum, zkSync Era, Linea) it seems natural to opt for finalisation on L1. Otherwise, we can take a risk, but hopefully this risk is low (we need to quantify it).

andygolay commented 1 week ago

If there's going to be an insistence on 32 blocks (over 6 minutes on average) per Eth transaction, for this bridge design, then that makes the bridge pretty much unusable in the context of our current UI. No one will wait 6 minutes for their L2 wallet to pop up and complete on L2. It's already some friction when it's fast, but if it's slow it just won't be used.

So IMHO, UX should not prevail and guide the backend design. The backend should be secure. On Optimism, it takes 1-3mins to bridge to Opt and a week to bridge back. On Arbitrum.) it takes 15-30mins to bridge to L2 and a week to bridge back. On zkSync Era , bridging is ~15mins (time to finalisation on Eth). On Linea it takes ~20mins.

Important

If it takes ~15mins to bridge to Mvnt, we are on par with the main chains.

Realistically, there's no such thing as perfect safety; there's just risk management. I think it should be up to the Movement Foundation to determine what risk profile they're willing to tolerate.

Yes that's true, and our role is to provide data to make informed decision.

One strong point for using finalisation as a criterion for relaying is that it is stable. If our use confirmations (12, 32, 65) the security guarantees may change over time, depending on Ethereum upgrades. for example the introduction ob blobs introduced frequent re-orgs.

Important

Finalisation is a stable criterion that is safe and does not change over time (the time to finalisation can change and become shorter in the future, that's what is expected).

If we want to be a safe chain (like Arbitrum, zkSync Era, Linea) it seems natural to opt for finalisation on L1. Otherwise, we can take a risk, but hopefully this risk is low (we need to quantify it).

Correct me if I'm wrong but it looks like from a quick read, Optimism may be only using 1 conf? (See their op-batcher code: https://docs.optimism.io/builders/chain-operators/tutorials/create-l2-rollup). And I read elsewhere that they only use 1 conf... but haven't found conclusive proof.

apenzk commented 1 week ago

Looking at https://docs.optimism.io/builders/chain-operators/configuration/proposer my first question would be what is this the number of confirmations for? In this link it says its the time for validators to react on the proposer transaction. So not sure if that relates to the bridge?

(Also in that link the default is 10)

Here more on the batcher https://specs.optimism.io/protocol/batcher.html

andygolay commented 1 week ago

Looking at https://docs.optimism.io/builders/chain-operators/configuration/proposer my first question would be what is this the number of confirmations for? In this link it says its the time for validators to react on the proposer transaction. Which is fine but also would be unrelated to the bridge.

(In that link the default is 10)

Okay, 10 probably makes more sense if it's 1 - 3 minutes. I can try to dig deeper into their code if I get time.

Something to consider about the above bridges @franck44 mentions:

Those are not HTLC bridges. They do not require the user to sign a transaction on the L2, if I understand correctly. (I've only given each bridge a cursory glance so please do correct me if I'm wrong about that.)

If we were to have a design where the user is not required to sign on L2 to receive their funds, then I think it would be more reasonable to have many more confirmations tolerated by users.

Regarding the comment "UX should not prevail and guide the backend design", if there is definition of UX where it doesn't seem important to prioritize UX, then that should be formalized. Movement's messaging has espoused optimizing for UX, meaning user experience. And rightly so. From my understanding of Movement's priorities, the user experience must always be top priority, with security being included as part of user experience. I will defer to @rolandoesparza to help facilitate priorities in that regard.

From what I can tell, my point still stands regarding if we limit the financial amount of assets transferred on each block, then that could result in a manageable loss as a cost of doing business scenario for Movement Foundation. Fees can be adusted so that Movement Foundation is profitable regardless of refunds.

I'm in the process of trying to get historical Ethereum mainnet reorg data to establish economic models to iterate on.

andygolay commented 4 days ago

Regarding how this impacts the UI, if we were to require say 32 confirmations, I guess one solution could be to just show a user's transfers and their states (pending, completed, refunded, etc) associated with each connected L1 and L2 wallet. So for example in the L1 -> L2 direction, instead of sitting and waiting for the L2 wallet to pop up, they can leave and come back too see whether they're ready to sign the "complete" transaction on L2. They can then click the "Complete" button to prompt the wallet to pop up and sign. Tagging @rolandoesparza @vpallegar as that might be a decent short-term UI fix.

This still doesn't solve the need to sponsor transactions on L2, though. And with Movement Foundation paying for refunds, 1. that could get very costly for Movement Foundation and 2. there's a risk of users forgetting to come back to the computer in a timely manner and finish the transfer. They can't complete on L2 on a different device than the one the initiated the transfer with, because the pre-image is stored locally. Maybe through some use of accounts that could change, like, users can log into their Movement account and have more of a multi-device experience, but that functionality is not built yet, nor is the mechanism to sponsor transactions.

If we could do 3 confirmations, then I think it would make sense to require users to sit at the screen and wait for their L2 wallet to pop up. But with 32 confirmations, there would need to be more of an async UI experience.

On a related note, I think the Simplified Bridge Design https://github.com/movementlabsxyz/MIP/pull/58/ is worth serious consideration because it would satisfy the finality asks in this issue, and it would remove the need for refunds and the need for users to have funds on L2 to cover the completion transaction fees.

franck44 commented 2 days ago

To clarify: the simplified bridge design is a standard lock/mint bridge design, it is not new.

The previous confusion originated from RFC-40 (Atomic bridge) which used an atomic swap to design a bridge transaction.

In an atomic swap, there are two users, one on chain A, and one on chain B. They want to swap assets atomically and they don't trust each other. That's why there are:

two transactions one on chain A and one on Chain B
a timeout to allow the users to accept or reject the deal (swap).

You can use the atomic swap mechanism to implement a bridge transaction but a bridge transaction is fundamentally different: there is one user with two accounts, one on chain A and one on chain B. The user cannot accept the deal on chain A (lock their asset) and reject the deal on chain B (mint the equivalent representation of the asset on chain B).

I raised this point a few times "a bridge is not a swap".