movementlabsxyz / movement

The Movement Network is a Move-based L2 on Ethereum.
Apache License 2.0
51 stars 48 forks source link

`PENDING` Race condition #463

Open 0xmovses opened 2 weeks ago

0xmovses commented 2 weeks ago

Summary

#[tokio::test]
async fn test_bridge_poll_stuck() {
    // Initialize the mock blockchain service
    let mut blockchain_service = MockBlockchainService::build();

    // Simulate initiating a bridge transfer
    blockchain_service
        .initiator_contract
        .with_next_bridge_transfer_id("transfer_id")
        .initiate_bridge_transfer(
            InitiatorAddress::from("initiator"),
            RecipientAddress::from("recipient"),
            HashLock("hash_lock"),
            TimeLock(100),
            Amount(1000),
        )
        .await
        .expect("initiate_bridge_transfer failed");

    // Simulate locking assets on the counterparty side
    blockchain_service
        .counterparty_contract
        .lock_bridge_transfer_assets(
            BridgeTransferId("transfer_id2"), 
            HashLock("hash_lock2"),
            TimeLock(100),
            InitiatorAddress::from("initiator2"),
            RecipientAddress::from("recipient2"),
            Amount(1000),
        )
        .await
        .expect("lock_bridge_transfer_assets failed");

    // Create a context for polling
    let mut cx = Context::from_waker(futures::task::noop_waker_ref());

    // Poll for the next event
    // This should return a Ready state since we've added events to both initiator and counterparty
    let event = blockchain_service.poll_next_event(&mut cx);

    // Assert that the event is not Pending
    // If this assertion fails, it indicates that the system is stuck in a Pending state
    // even though there are events in the queue.
    // The system might get stuck in a Pending state due to how initiator and counterparty
    // events are processed simultaneously.
    // https://github.com/movementlabsxyz/movement/blob/dd8c8a7a5a946e8071550bf3192176aa99b39fe8/protocol-units/bridge/shared/src/blockchain_service.rs#L45-L58
    assert_ne!(
        event,
        Poll::Pending
    );

    // Print the event for debugging purposes
    println!("{:?}", event);

    // Add another event to the queue to simulate a new bridge transfer to see if another event
    // detection works
    blockchain_service
        .initiator_contract
        .with_next_bridge_transfer_id("transfer_id3")
        .initiate_bridge_transfer(
            InitiatorAddress::from("initiator3"),
            RecipientAddress::from("recipient3"),
            HashLock("hash_lock3"),
            TimeLock(100),
            Amount(1000),
        )
        .await
        .expect("initiate_bridge_transfer failed");

    // Create a new context for polling
    let mut cx = Context::from_waker(futures::task::noop_waker_ref());

    // Poll for the next event again
    let event = blockchain_service.poll_next_event(&mut cx);

    // Assert that the event is not Pending
    // If this assertion fails, it indicates that the system is stuck in a Pending state
    // even though there are events in the queue.
    assert_ne!(
        event,
        Poll::Pending
    );

    // Print the event for debugging purposes
    println!("{:?}", event);
}

Possible Solutions

Event Queuing

Idea: Queue up the events from the initiator and counterparty separately. When an event is detected, enqueue it and process the queue in order, ensuring that events are processed one at a time. Implementation: Maintain separate queues for initiator and counterparty events, and process each queue in the order events were received.

let mut event_queue: VecDeque<Event> = VecDeque::new();

if let Poll::Ready(event) = blockchain_service.poll_next_event_from_initiator(&mut cx) {
    event_queue.push_back(event);
}

if let Poll::Ready(event) = blockchain_service.poll_next_event_from_counterparty(&mut cx) {
    event_queue.push_back(event);
}

if let Some(event) = event_queue.pop_front() {
    return Poll::Ready(event);
}

Poll::Pending

Atomic State Management

Idea: Use atomic state management to keep track of the state of event processing. For example, you can use an atomic flag to ensure that only one event (from either the initiator or counterparty) is processed at any given time. Implementation: An atomic flag can be used to track whether the system is currently processing an event. If one event is being processed, other events are temporarily ignored until the first one is completed.

use std::sync::atomic::{AtomicBool, Ordering};

static IS_PROCESSING: AtomicBool = AtomicBool::new(false);

if !IS_PROCESSING.compare_and_swap(false, true, Ordering::SeqCst) {
    match blockchain_service.poll_next_event(&mut cx) {
        Poll::Ready(event) => {
            IS_PROCESSING.store(false, Ordering::SeqCst);
            return Poll::Ready(event);
        }
        Poll::Pending => {
            IS_PROCESSING.store(false, Ordering::SeqCst);
        }
    }
}
musitdev commented 1 week ago

First, I'll explain how I understand this code to ensure we're aligned on what needs to be done.

The function poll_next_event in the BlockchainService trait is integrated into a stream of ContractEvent events. Each time the poll_next function is called, it invokes the two poll methods from CounterpartyContractMonitoring and InitiatorContractMonitoring. This implementation returns when either of the two methods returns. The possible cases are:

The proposed solutions don’t address this issue. I don’t fully understand the second solution, but like the first, it only returns one event.

I don't see the issue with returning Poll::Pending. In the test test_bridge_poll_stuck(), we initialize both sources with events. There's no reason to receive a Poll::Pending if the polling method works correctly. If Poll::Pending is returned, it suggests that the polling method of one or both services isn't working as expected because they aren’t returning their events. The test also doesn't verify that all events from both sources are received. Since both sources contain events, we should get all of them. Because the code doesn't handle the (Ready, Ready) case, there's a strong chance that it will fail.

0xmovses commented 1 week ago

Yes, you're right.

We should handle the (Ready, Ready) case though as that will be a likely case.

musitdev commented 1 week ago

To do that, there are 2 possibilities, I think:

I don't have a real opinion. It seems the second do fewer changes, but I'm not sure as I don't know all the code that process event. Tell which you prefer.

0xmovses commented 1 week ago

Second choice 🙏