Summary - CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels

Introduction

The verification of concurrent OS kernels is difficult due to the complex interdependencies among kernel components and the need to support multiple forms of concurrency. A novel approach for building a certified concurrent OS kernel, where interleaved execution of kernel and user modules across different layers of abstraction are allowed. The authors emphasize on formally specifying these layers and their observable events, and then verifying each module at its proper abstraction level. A strong contextual refinement property is proved for each function to support certified linking with other CPUs or threads. The authors have successfully developed a practical concurrent OS kernel, verified its functional correctness in Coq, and written it in 6500 lines of C and x86 assembly. It runs on stock x86 multicore machines and is considered the first proof of functional correctness of a complete, general-purpose concurrent OS kernel with fine-grained locking.

CertiKOS, a novel extensible architecture for building certified concurrent OS kernels. The architecture uses contextual refinement and environment contexts to untangle kernel interdependencies and encapsulate interference among different kernel objects. This allows for standard verification techniques to be applied to concurrent programs, making the verification task simpler and easily automatable. The authors have successfully developed a fully certified concurrent OS kernel (mC2) that supports fine-grained locking and thread yield/sleep/wakeup primitives. The entire proof effort for supporting concurrency took less than 2 person-years. The rest of the paper provides an overview of the CertiKOS architecture, describes how environment contexts are used, presents the design and development of the mC2 kernel, and provides an evaluation of CertiKOS.

Overview of the approach The research goal in building certified OS kernels is to find the best design and development methodologies to build reliable, secure, and efficient computer systems in a cost-effective way. Evaluation metrics include support for new kernel design, kernel performance, verification of global properties, quality of kernel specification, cost of development and maintenance, and quality of formal proofs. CertiKOS is a new architecture for building certified concurrent operating system kernels. It leverages new certified programming methodologies and provides compositional layer semantics to support the parallel composition of sequential layers and ensure their soundness. To support concurrency, each layer interface is parameterized with an active thread set, and its set of valid environment contexts is defined. The whole-machine semantics is then obtained by combining per-thread machines. The mC2 kernel is a concurrent operating system that doubles as a hypervisor and runs on a multicore Intel Core i7 machine. It contains various shared objects such as spinlock modules, sleep and pending queues, memory management modules, and IPC modules. The mC2 kernel has been successfully built and certified using CertiKOS, and its contextual refinement property with respect to a high-level deep specification has been proven, ensuring that all system calls and traps will follow the high-level specification and run safely and terminate eventually. However, the mC2 kernel is not as comprehensive as real-world kernels and has some limitations, such as the assumption of strong sequential consistency, the lack of a certified storage system, and only covering a small part of the full x86 instruction set. The mC2 kernel also relies on a bootloader, PreInit module, and ELF loader, which are left for future verification.

Machine with local copy of shared memory A method for verifying concurrent programs running on a multicore system. The authors introduce a new machine model which allows reasoning about a subset of active CPUs by specifying the expected behaviors of the non-active CPUs, called the environment context. The environment context returns a list of events generated by the context programs when given the current log. Certifying the mC2 Kernel A verification method for complex kernels using Contextual Refinement is decomposed into a series of logical abstraction layers, which are designed in a way to untangle complex interdependent kernel components into a well-organized kernel object stack. The method focuses on two main components in the mC2 kernel: the pre-initialization module and the trap handler. The authors prove the mutual exclusion and starvation-freedom of a ticket lock implementation in the mC2 kernel using a fair hardware scheduler. The proof is based on the fairness of the scheduler and the assumption that between any two consecutive events from the same thread, there are at most m events generated by other threads.

Virtual memory management is a technique used by operating systems to provide an illusion of large, continuous memory space to the processes running on a computer. This is achieved by mapping virtual memory addresses used by processes to the physical memory addresses. The primitives used for manipulating the mapping between virtual and physical memory (page maps) are proven to be correct and the initialization procedure sets up the mapping properly, taking into account the hardware address translation mechanism.

Shared memory management is a technique used to allow multiple processes to access the same physical memory pages. The physical pages are mapped into the virtual address space of each process that needs to access them, and a protocol is used to manage the shared access to the pages. The owner set is a data structure that maintains information about the processes that have access to each physical page. The purpose of the owner set is to keep track of which processes have access to the shared memory and to ensure that the memory is used in a consistent manner. The shared queue library is a software component that provides a higher-level abstraction for queues implemented as doubly-linked lists. Instead of working with the underlying implementation, processes interact with abstract queue states represented as Coq lists. The library defines local enqueue and dequeue operations that operate on the abstract lists, and each shared queue is associated with a lock to control access and ensure consistency. The lock is used to protect the shared queue data structures from concurrent access and modification by multiple processes.

Thread management is the process of controlling and organizing the resources used by dynamically created threads in an operating system. It involves the introduction of a thread control block to manage the resources and meta-data of each thread, such as quotas, children, and thread state. A kernel stack is allocated for each thread, with a size of one page (4KB). An external tool was used to verify that the stack usage of the compiled kernel is less than 4KB, ensuring that stack overflows cannot occur inside the kernel. This helps to maintain the stability and reliability of the system by avoiding the risk of memory corruption due to stack overflow.

A starvation-free condition variable is a synchronization object used in concurrent programming to allow threads to wait for a change in a shared state that is protected by a lock. Unlike standard Mesa-style condition variables, a starvation-free condition variable guarantees that a waiting thread will be signaled within a bounded number of steps, avoiding the situation where a thread is stuck waiting indefinitely. Implementation of a starvation-free condition variable uses condition queues, as proposed by Anderson and Dahlin. This ensures that all waiting threads have a fair chance to be notified, avoiding the risk of starvation and improving the overall performance and reliability of the system.

Thread-local machine models are models of concurrent systems that are built based on the thread management layers. The construction of these models starts by extending the environment context with a software scheduler, which abstracts the concrete scheduling procedure. The scheduling primitives generate yield and sleep events, and the extended environment context responds with the next thread ID to be executed. This allows the thread-local machine model to capture the behavior of the system in terms of the interactions between the threads, the scheduler, and the environment. By using a thread-local machine model, it is possible to analyze and verify properties of the concurrent system, such as safety, liveness, and performance, before it is actually implemented.

Evaluation: The mC2 kernel provides two types of spinlocks: the ticket lock and the MCS lock. These two locks have the same interface and can be used interchangeably, providing users with flexibility in their implementation. To evaluate the performance of these spinlocks, an empty critical section (payload) is protected by a single lock, and the latency is measured by taking 10,000 consecutive lock acquisitions and releases (transactions) on each round. This allows for a direct comparison between the two locks, and for the evaluation of their performance in different conditions and configurations.

To measure the performance of the mC2 kernel, the system calls Call and ReplyWait of seL4 are replaced with mC2's synchronous send and receive calls. The results show that when the buffer size is zero, mC2 takes approximately 3800 cycles to perform a round trip Inter-Process Communication (IPC), while seL4's fastpath IPC takes roughly 1200 cycles and its slowpath IPC takes 1800 cycles. When the message size is larger than 2 words, seL4's fastpath IPC falls back to the slowpath. In the case of a 10-word IPC, mC2's round trip IPC takes 3820 cycles, while seL4 takes 1830 cycles. These results suggest that mC2 has a slower performance than seL4 for small buffer sizes, but performs similarly for larger buffer sizes.

In order to evaluate mC2's performance as a hypervisor, its performance with KVM (another popular hypervisor) by running several macro benchmarks on an Ubuntu 12.04.2 LTS guest operating system was compared. The guest operating system was installed on an internal SSD drive, while both KVM and mC2 were installed on a USB stick. The results showed that mC2 performed similarly to KVM when the same conditions were applied. The tests were performed with standard 4KB pages and no huge pages were used.

Related Work: Dijkstra's idea of decomposing a complex program into a hierarchy of abstract machines was applied in the Hierarchical Development Methodology (HDM) by the PSOS team at SRI. This method was used to design and specify an operating system using 20 hierarchically organized modules, and was also used for the KSOS system. Later, new languages and tools were developed for building certified abstraction layers with deep specifications and for constructing fully certified (sequential) OS kernels in Coq. The layer methodology was extended to build certified kernels and device drivers running on multiple logical CPUs, treating the driver stack for each device as if it were running on a dedicated logical CPU. However, none of these systems can support shared-memory concurrency with fine-grained locking.

Conclusion: In summary, the authors have introduced a new approach for building certified concurrent operating system kernels that are both efficient in terms of assembly implementation and have contextual correctness proofs. This approach is based on the Hierarchical Development Methodology and uses a layered methodology for building OS kernels. The new approach offers a clean and rigorous specification of all kernel components, making it practical for building certified concurrent kernels. The authors believe that this approach will pave the way for building more reliable, secure, and extensible system software in the future.

tapaswenipathak / linux-kernel-stats

Summary - CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels #106