This RFC issue consists of two parts: One is SwornDisk high-level design overview, which explains SwornDisk's "why, how, what". One is SwornDisk-Occlum's implementation review, which explains code structure and details of this SGX version.
Design Overview
SwornDisk: A Log-Structured Secure Block Device for TEEs
Objectives: confidentiality, integrity, freshness, anonymity, consistency, and (flush) atomicity
Motivation
Existing solutions for protecting the on-disk data for TEEs are far from satisfactory in terms of both
security and performance (eCryptfs, fscrypt, dm-crypt, SGX-PFS)
Yet SGX-PFS has both performance issue (Slow random writes due to 2 × H write amplification) and security vulnerability (Unanticipated snapshot attacks, CVE-2022-27499, our website)
Unanticipated snapshot attack: The adversary can capture and replay transient on-disk states (due to cache eviction in TEE) which are un-aware to users
Background knowledge
In-place updates MHT-based approach VS. out-of-place updates log-structured approach
Random writes are slower than sequential writes in HDD/SSD
Write amplification: 2 × H vs. 1 + ϵ (ϵ ≪ 1)
Log-Structured Merge tree (LSM-tree)
A leveled, ordered, disk-oriented index structure for KV stores. The core idea is to use append-only(sequential) writes to suit write-intensive workloads, avoid fragmentation writes like B-trees.
The data are organized in memory of MemTable and in persistence of SST files.
The read performance is degraded and LSM-tree uses bloom-filter and compaction strategy to minimize.
Usecase: BigTable, Hbase, LevelDB, RocksDB
Workflow: KV pair → MemTable → Sorted String Table → Minor compaction to L0 → Major compaction to Li
Architecture
SwornDisk performs out-of-place data updates. It keeps the mapping between user-query block address (LBA) and eventually-persist block address (HBA) in TEE.
It introduces tailor-made LSM-tree to index confidential data and only use MHT to protect the index (much smaller than data) itself. Cascade updates of MHTs is avoided since all disk content of index are all immutable.
There is also a journal subsystem to summarize on-disk updates to ensure crash consistency and atomicity.
This technique minimizes write amplification, where each write generates one data block, one or more index records (due to compaction), and one journal record.
Block I/O operations
read()
params: start address LBA, a number of block buffers
Retrieve the HBAs, encryption keys, and MACs of these blocks from secure index (LSM-tree)
Read and decrypt the encrypted data blocks from the HBAs
Return to user plaintext data after verification
write()
params: start address LBA, a number of block buffers
Save data in segment buffer and notify user of completion immediately
When segment buffer becomes full or flush request received,
Encrypt each block with random key, calculate MAC, and persist the segment to allocated disk location
New generated index records are inserted to LSM-tree(persist to index region), new journal records are persisted to journal region
flush()
params: none
Trigger flushing the new data in the temporary segment buffers to the physical disk
Write journal to ensure consistency and atomicity
trim()
params: start address LBA, end address LBA
Similar to write, except no new data is written, only the index is updated to discard the specified data blocks
Garbage Collection (segment cleaning)
SwornDisk's log-structured design lets newer data and older data coexist. So during writing new data, older data must be invalidated to benefit incoming GC.
Before every writes, SwornDisk retrieves older index records and invalidate the corresponding HBA (in DST).
A periodic GC worker would choose a victim segment, migrate the still valid blocks and free this data segment.
Index region
Disk oriented secure LSM-tree (dsLSM-tree): Organize the disk content directly on a raw disk without the help of file systems.
Block Index Table (BIT): Replacement of traditional SST. BIT integrates an MHT with a B+ tree. Each node is fixed-size and authentication encrypted.
Leaf nodes: Array of data records [ LBA → (HBA, Key, MAC) ]
Journal region
Journal contains a series of records that summarize the information of each on-disk updateof the secure data log and the secure index.
record contains cryptographic information about the corresponding on-disk updates;
journal block (composed of multiple records) is chained with each other, embedded the MAC of the previous one;
SwornDisk realizes consistency based on three internal journal operations: journaling, checkpointing, and recovery.
Journaling
Each on-disk update of the secure data log and the secure index is followed by writing a corresponding journal record for the durability and security of the update.
Record Types
Description
Data log
Summarizes the update to a data segment (data region)
BIT node
Summarizes a new BIT node (index region)
BIT compaction
Saves the progress of a BIT compaction
Checkpoint pack
Summarizes a new checkpoint pack (checkpint region)
Commit
Marks prior data/index as committed
Checkpointing
To reclaim the disk space consumed by outdated journal records and speed up the recovery process, SwornDisk periodically transforms journal records into a more compact format called checkpoint packs.
checkpoint region preserves backups of BITC, SVT, DST, and RIT;
checkpoint pack consists of the creation timestamp, the head and tail positions of the secure journal, and the bitmaps to choose valid backups for recovering;
Recovery
During recovery, SwornDisk selects the most recent checkpoint pack, from which it initializes its in-memory data structures. Then, it continues reading the rest of the journal, one record at a time, deciding whether it should be accepted to restore SwornDisk to a consistent state.
Checkpoint region
Consist of some auxiliary data structures for index query and segment management:
Block Index Table Catalog (BITC): Recording the metadata of a BIT [ BIT ID, level, key range, root node ]
Used for manage LSM-tree's BITs
Segment Validity Table (SVT): A bitmap where each bit indicates whether a segment is valid
Used for allocation/deallocation of data/index segments
Data Segment Table (DST): Contain per-segment metadata of the data segments (valid block bitmap)
Used for manage invalidation of blocks in each segment, and GC
Reverse Index Table (RIT): Mapping from HBAs to LBAs
Used for GC
Further discussion
Other important points worth to discuss but lack of space:
Compaction-based, delayed block reclamation; Flush atomicity based on commitment; Key acquisition and protection flow; space clipping; Performance tuning.
In a nutshell
This RFC issue consists of two parts: One is SwornDisk high-level design overview, which explains SwornDisk's "why, how, what". One is SwornDisk-Occlum's implementation review, which explains code structure and details of this SGX version.
Design Overview
SwornDisk: A Log-Structured Secure Block Device for TEEs
Objectives: confidentiality, integrity, freshness, anonymity, consistency, and (flush) atomicity
Motivation
2 × H
write amplification) and security vulnerability (Unanticipated snapshot attacks, CVE-2022-27499, our website)Background knowledge
Random writes are slower than sequential writes in HDD/SSD
Write amplification:
2 × H
vs.1 + ϵ (ϵ ≪ 1)
Architecture
SwornDisk performs out-of-place data updates. It keeps the mapping between user-query block address (LBA) and eventually-persist block address (HBA) in TEE.
It introduces tailor-made LSM-tree to index confidential data and only use MHT to protect the index (much smaller than data) itself. Cascade updates of MHTs is avoided since all disk content of index are all immutable.
There is also a journal subsystem to summarize on-disk updates to ensure crash consistency and atomicity.
This technique minimizes write amplification, where each write generates one data block, one or more index records (due to compaction), and one journal record.
Block I/O operations
read()
params: start address LBA, a number of block buffers
write()
params: start address LBA, a number of block buffers
flush()
params: none
trim()
params: start address LBA, end address LBA
Garbage Collection (segment cleaning)
SwornDisk's log-structured design lets newer data and older data coexist. So during writing new data, older data must be invalidated to benefit incoming GC.
Before every writes, SwornDisk retrieves older index records and invalidate the corresponding HBA (in DST).
A periodic GC worker would choose a victim segment, migrate the still valid blocks and free this data segment.
Index region
Journal region
Journal contains a series of records that summarize the information of each on-disk update of the secure data log and the secure index.
SwornDisk realizes consistency based on three internal journal operations: journaling, checkpointing, and recovery.
Journaling
Each on-disk update of the secure data log and the secure index is followed by writing a corresponding journal record for the durability and security of the update.
Checkpointing
To reclaim the disk space consumed by outdated journal records and speed up the recovery process, SwornDisk periodically transforms journal records into a more compact format called checkpoint packs.
Recovery
During recovery, SwornDisk selects the most recent checkpoint pack, from which it initializes its in-memory data structures. Then, it continues reading the rest of the journal, one record at a time, deciding whether it should be accepted to restore SwornDisk to a consistent state.
Checkpoint region
Consist of some auxiliary data structures for index query and segment management:
Further discussion
Other important points worth to discuss but lack of space:
Compaction-based, delayed block reclamation; Flush atomicity based on commitment; Key acquisition and protection flow; space clipping; Performance tuning.
Implementation Review
[WIP]