occlum / ngo

Next-Gen Occlum, a work-in-progress fork of Occlum that is optimized for the next-generation of Intel SGX (on Xeon SP processors)
Other
33 stars 18 forks source link

[RFC] Introduce SwornDisk in NGO #328

Open lucassong-mh opened 1 year ago

lucassong-mh commented 1 year ago

In a nutshell

This RFC issue consists of two parts: One is SwornDisk high-level design overview, which explains SwornDisk's "why, how, what". One is SwornDisk-Occlum's implementation review, which explains code structure and details of this SGX version.

Design Overview

SwornDisk: A Log-Structured Secure Block Device for TEEs

Objectives: confidentiality, integrity, freshness, anonymity, consistency, and (flush) atomicity

1

Motivation

2

Background knowledge

  1. In-place updates MHT-based approach VS. out-of-place updates log-structured approach

3

  1. Log-Structured Merge tree (LSM-tree)

4

Architecture

5

SwornDisk performs out-of-place data updates. It keeps the mapping between user-query block address (LBA) and eventually-persist block address (HBA) in TEE.

It introduces tailor-made LSM-tree to index confidential data and only use MHT to protect the index (much smaller than data) itself. Cascade updates of MHTs is avoided since all disk content of index are all immutable.

There is also a journal subsystem to summarize on-disk updates to ensure crash consistency and atomicity.

This technique minimizes write amplification, where each write generates one data block, one or more index records (due to compaction), and one journal record.

Block I/O operations

read()

params: start address LBA, a number of block buffers

  1. Retrieve the HBAs, encryption keys, and MACs of these blocks from secure index (LSM-tree)
  2. Read and decrypt the encrypted data blocks from the HBAs
  3. Return to user plaintext data after verification

read

write()

params: start address LBA, a number of block buffers

  1. Save data in segment buffer and notify user of completion immediately
  2. When segment buffer becomes full or flush request received,
  3. Encrypt each block with random key, calculate MAC, and persist the segment to allocated disk location
  4. New generated index records are inserted to LSM-tree(persist to index region), new journal records are persisted to journal region

write

flush()

params: none

  1. Trigger flushing the new data in the temporary segment buffers to the physical disk
  2. Write journal to ensure consistency and atomicity

trim()

params: start address LBA, end address LBA

  1. Similar to write, except no new data is written, only the index is updated to discard the specified data blocks

Garbage Collection (segment cleaning)

SwornDisk's log-structured design lets newer data and older data coexist. So during writing new data, older data must be invalidated to benefit incoming GC.

Before every writes, SwornDisk retrieves older index records and invalidate the corresponding HBA (in DST).

A periodic GC worker would choose a victim segment, migrate the still valid blocks and free this data segment.

Index region

Journal region

Journal contains a series of records that summarize the information of each on-disk update of the secure data log and the secure index.

SwornDisk realizes consistency based on three internal journal operations: journaling, checkpointing, and recovery.

Journaling

Each on-disk update of the secure data log and the secure index is followed by writing a corresponding journal record for the durability and security of the update.

Record Types Description
Data log Summarizes the update to a data segment (data region)
BIT node Summarizes a new BIT node (index region)
BIT compaction Saves the progress of a BIT compaction
Checkpoint pack Summarizes a new checkpoint pack (checkpint region)
Commit Marks prior data/index as committed

Checkpointing

To reclaim the disk space consumed by outdated journal records and speed up the recovery process, SwornDisk periodically transforms journal records into a more compact format called checkpoint packs.

Recovery

During recovery, SwornDisk selects the most recent checkpoint pack, from which it initializes its in-memory data structures. Then, it continues reading the rest of the journal, one record at a time, deciding whether it should be accepted to restore SwornDisk to a consistent state.

image

Checkpoint region

Consist of some auxiliary data structures for index query and segment management:


Further discussion

Other important points worth to discuss but lack of space:

Compaction-based, delayed block reclamation; Flush atomicity based on commitment; Key acquisition and protection flow; space clipping; Performance tuning.

Implementation Review

[WIP] ngoiostack