storacha / w3filecoin-infra

⛴️ Filecoin Pipeline for web3.storage
Other
2 stars 2 forks source link

Redesign w3filecoin data stack #25

Closed vasco-santos closed 1 year ago

vasco-santos commented 1 year ago

We can't rely on current design given past assumptions on how an aggregate size was calculated was not correct. We need to compute commP of commPs to check it out (and ordered by CAR size)

1. Priority queue for cargo with dynamoDB state machine based on indexes

There are two tables within the DB Stack acting as a Priority Queue:

Schema

type Cargo = {
  // CAR file CID
  link Link
  // Filecoin Piece CID - commP of CAR file
  pieceLink Link
  // Filecoin Piece Size
  pieceSize number
  // State of the cargo in the pipeline
  stat CARGO_STAT
  // Priority in the queue - for now likely same as queuedAt
  priority string
  // Timestamp
  queuedAt string
  // TODO: Maybe timestamps for other stats?
  // Filecoin Aggregate CID - commP of commPs
  aggregateLink? Link
  // Failed to add into aggregate code
  aggregateFailedCode? string

  // INDEXES
  // primaryIndex: { partitionKey: link },
  // globalIndexes: {
  //   indexStat: {
  //     partitionKey: 'stat',
  //     sortKey: 'priority',
  //     projection: 'all' // TODO: see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html
  //   },
  //   indexAggregate: {
  //     partitionKey: 'aggregateLink',
  //     sortKey: 'size',
  //     projection: 'keys_only' // TODO: see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html
  //   }
  //.  // TODO: Maybe we need an index to query by pieceLink
  // }
}

type Ferry = {
  // Filecoin Aggregate CID - commP of commPs
  link Link
  // Aggregate size in bytes - TODO: maybe nice to have for metrics
  size number
  // State of the ferry in the pipeline
  stat FERRY_STAT
  // Priority in the queue - for now likely same as queuedAt
  priority string
  // Timestamp
  queuedAt string

  // INDEXES
  // primaryIndex: { partitionKey: link },
  // globalIndexes: {
  //   indexStat: {
  //     partitionKey: 'stat',
  //     sortKey: 'priority',
  //     projection: 'keys_only' // TODO: see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html
  //   }
  // }
}

// CID
type Link = string

// State of Cargo state machine
type CARGO_STAT = 'QUEUED' | 'OFFERING' | 'SUCCEED' | 'FAILED'

// State of Ferry state machine: 
type FERRY_STAT = 'QUEUED' | 'ARRANGING' | 'SUCCEED' | 'FAILED'

State Machine

Cargo State Machine might have the following state changes:
* `QUEUED` -> `OFFERING` - when cargo item is associated with an aggregate to offer for storage
* `OFFERING` -> `SUCCEED` - end state as cargo is already available in Storage Providers
* `OFFERING` -> `FAILED` - cargo could not make it to Storage Provider because this specific cargo failed (e.g. wrong commP, or could not be fetched)
* `OFFERING` -> `QUEUED` - cargo could not make it to Storage Provider because other cargo in same aggregate failed, but there is no issue with this specific cargo reported. Therefore, it can be queued for other Aggregate inclusion
* `FAILED` -> `SUCCEED` - cargo previously failed but reason behind it is now solved

Ferry State Machine might have the following state changes:
* `QUEUED` -> `ARRANGING` - when given ferry was included in an `aggregate/offer` invocation to Storage Broker
* `ARRANGING` -> `SUCCEED` - when `aggregate/offer` for ferry succeeded
* `ARRANGING` -> `FAILED` - when `aggregate/offer` for ferry failed

Flow

  1. CAR Files get inserted into cargo Table once R2 write AND commP write events happen (Consumer stack context)
  2. cargo table stream consumer attemptFerryLoad lambda is triggered once 100 Inserts OR 15 minutes pass (or maybe a CRON Job?). Lambda performs:
    1. queries DB for a page of stat QUEUED via index indexStat
    2. sorts page results by their size and attempts to create an aggregate with a compatible size with the results. In case size is not enough, it attempts to get more pages until either having enough cargo or stopping until next call.
    3. performs a DB transaction write items updating stat to OFFERING and sets aggregateLink AND write entry to ferry Table with the aggregate information (with constraints to guarantee previous state are the same). TODO: figure out batching transaction size limitations
  3. ferry table stream consumer invokeAggregateOffer lambda is triggered once INSERT operation happens in table
    1. Invokes aggregate/offer
    2. Mutates stat to ARRANGING (in case of failure it will be retried, but is fine given first operation is idempotent)
  4. CRON keeps triggering Lambda function to check for Receipts for ferries with stat ARRANGING
    1. Once receipt is available, stat is mutated to either SUCCEED or FAILED. In case FAILED, cargo should also have aggregateFailedCode updated.

What are we missing?

Notes

2. Priority Queue based on a table per State

There are a few tables within the DB Stack acting as a State Machine Queue:

Schema

interface CargoQueued {
  // CAR file CID
  link Link
  // Filecoin Piece CID - commP of CAR file
  pieceLink Link
  // Filecoin Piece Size
  pieceSize number
  // Priority in the queue - for now likely same as insertedAt
  priority string
  // Timestamp
  insertedAt string
  // INDEXES
  // primaryIndex: { partitionKey: link, sortKey: priority },
  // indexPiece: {
  //     partitionKey: 'pieceLink',
  //     sortKey: 'priority',
  //     projection: 'all' // TODO: see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html
  //   }
  // }
    // Note that there are no strong guarantees of unique CAR so that we 
    // can add failed Cargo again
}

interface CargoOffered extends CargoQueued {
  // Filecoin Aggregate CID - commP of commPs
  aggregateLink Link
  // INDEXES
    // globalIndexes: {
    //   indexAggregate: {
  //     partitionKey: 'aggregateLink',
  //     sortKey: 'size',
  //     projection: 'keys_only' // TODO: see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html
  //   }
  // }
    // Note that there are no strong guarantees of unique CAR so that we 
    // can add failed Cargo again
}

interface CargoProcessed extends CargoOffered {
    // Failed to add into aggregate code
  failedCode? string
}

interface FerryQueued {
  // Filecoin Aggregate CID - commP of commPs
    link Link
    // Aggregate size in bytes - TODO: maybe nice to have for metrics
    size number
    // Priority in the queue - for now likely same as queuedAt
  priority string
  // Timestamp
  queuedAt string

    // INDEXES
  // primaryIndex: { partitionKey: link },
}

interface FerryArranged extends FerryQueued {
    // Ferry was stored in SP
    succeed boolean
}

Flow

TODO

Notes

3. Priority Queue based on a table per State with Reads from Aggregated Queue from 1

Notes

4. Maybe SQL is our best bet…

Schema

CREATE TABLE cargo
(
    -- CAR file CID
    link TEXT PRIMARY KEY, -- perhaps pieceLink should be the primary key
    -- Filecoin Piece CID - commP of CAR file
    pieceLink TEXT NOT NULL,
    -- Filecoin Piece Size
  pieceSize number NOT NULL,
  -- State of the cargo in the pipeline
  stat CARGO_STAT NOT NULL,
  -- Priority in the queue - for now likely same as queuedAt
  priority TEXT NOT NULL,
  -- Timestamp
  queuedAt TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL,
  -- TODO: Maybe timestamps for other stats?
  -- Filecoin Aggregate CID - commP of commPs
  aggregateLink TEXT REFERENCES ferry(link),
  -- Failed to add into aggregate code
  aggregateFailedCode TEXT,
);

CREATE INDEX cargo_stat_idx ON cargo (stat);

CREATE TABLE ferry
(
    -- Filecoin Aggregate CID - commP of commPs
  link TEXT PRIMARY KEY,
  -- Aggregate size in bytes - TODO: maybe nice to have for metrics
  size number NOT NULL,
  -- State of the ferry in the pipeline
  stat FERRY_STAT NOT NULL,
  -- Priority in the queue - for now likely same as queuedAt
  priority TEXT NOT NULL,
  -- Timestamp
  queuedAt TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL,
);

-- State of Cargo state machine
CREATE TYPE CARGO_STAT AS ENUM
(
    'QUEUED',
    'OFFERING',
    'SUCCEED',
    'FAILED'
);

-- State of Ferry state machine:
CREATE TYPE FERRY_STAT AS ENUM
(
    'QUEUED',
    'ARRANGING',
    'SUCCEED',
    'FAILED'
);

State Machine

Basically same as suggestion 1

Flow

  1. CAR Files get inserted into cargo Table once R2 write AND commP write events happen (Consumer stack context)
  2. CRON JOB triggers lambda function over time. Lambda performs:
    1. queries cargo table for a page of stat QUEUED
    2. sorts page results by their size and attempts to create an aggregate with a compatible size with the results. In case size is not enough, it attempts to get more pages until either having enough cargo or stopping until next call.
    3. performs a DB transaction updating stat to OFFERING and setting aggregateLink AND insert entry to ferry Table with the aggregate information (it is required to guarantee previous state are the same and no concurrent job added something to other aggregate in the meantime)
  3. CRON JOB triggers lambda function over time. Lambda performs:
    1. queries ferry table for an entry of stat QUEUED
    2. invokes aggregate/offer to spade-proxy (Must be idempotent!!)
    3. mutates stat to ARRANGING in case of partial failure in second write (first was offer invocation),
  4. CRON keeps triggering Lambda function to check for Receipts for ferries with stat ARRANGING
    1. Once receipt is available, stat is mutated to either SUCCEED or FAILED. In case FAILED, cargo should also have aggregateFailedCode updated.

Notes

5. Maybe SQL is our best bet with Table per State

It is possible to do idea 2 quite easily due to the extra query capabilities that SQL has. There are still a few drawbacks worth flagging

Conclusions

Based on all the above ideas and their drawbacks we need to take a decision.

Being able to not be tied to Dynamo and being able to change easily can be good for the future. But what if we are looking at the wrong way of being able to replicate state?

Particularly the write only Table option would be great, but added complexity to not have out of the box guarantee of unique item in cargo table is a bigger drawback.

Option 1 would work fine if we make sure to re-run commP of commPs on Flow 3 before submitting offer. This would require the invalidation of bad state that was partially written in the CRON if for some reason only part of the TransactionsWrite batches made it.

Option 4 requires more time based operations (CRON Jobs) and SQL usage, but looks like the option with less gotchas.

Reading https://dynobase.dev/dynamodb-vs-aurora/ it looks like cost wise should not be much different, and therefore probably relying on Aurora is our best option.

vasco-santos commented 1 year ago

Closed with https://github.com/web3-storage/w3filecoin/pull/26