[TC 19] Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

iaGuoZhi commented 2 years ago

Main idea

FPGA 加速器使用的IOMMU页表和CPU使用的MMU页表是同一套，因此FPGA可以直接使用va来访存，而不是依赖memory copy。作者认为现在的FPGA加速器要和cpu使用同一个va进行访存，需要设计一个SVM framework.

该工作提供的framework包括两个部分:

运行在host cpu上的用来管理IOMMU svm的代码与用户态库.
一个实现在FPGA上的soft-core IOMMU

作者认为现有的SOC中IOMMU不能够处理主动page fault也不能够让kernel来处理page fault（时间太长），需要pin memory。于是在FPGA上实现了一个soft IOMMU（其实就是iotlb），通过iotlb绕过host IOMMU来处理page fault（iotlb miss就发缺页异常到host，host处理，当前在iotlb 上translate 的transaction被abort掉）。作者探究了几种iotlb的设计（针对不同场景，latency sensitive or bandwidth sensitive or hybrid) 并进行了测试。

作者实现的这个svm框架既能够使用FPGA上自己实现的IOMMU，也能够兼容host IOMMU。

Key insight

在FPGA上实现了IOTLB来处理io page fault。

My comments

在设备上实现IOTLB来处理io page fault，这和PRI异曲同工。

iaGuoZhi commented 2 years ago

这篇文章关注于嵌入式场景的FPGA，目前主流厂商的SOC中具有IOMMU，但是缺乏一个infrastructure来让application通过IOMMU使用SVM。它认为需要一个新的抽象。

iaGuoZhi commented 2 years ago

该工作提供的framework包括两个部分:

运行在host cpu上的用来管理IOMMU svm的代码与用户态库.
一个实现在FPGA上的soft-core IOMMU

iaGuoZhi commented 2 years ago

In many cases accelerator trafﬁc is more bandwidth-sensitive than latency-sensitive

iaGuoZhi commented 2 years ago

In our previous works, we have explored lightweight SVM support for PMCAs based on a software-managed IOTLB considering applications based on regular [20] and irregular (pointerrich) [21] memory access patterns, and exploring PMCA-local IOTLB management [22].

iaGuoZhi commented 2 years ago

At the heart of any IOMMU design sits an input/ output translation lookaside buffer (IOTLB).

iaGuoZhi commented 2 years ago

Instead, a separate and empty I/O page table is generated at setup time. The ﬁrst TLB miss to every page then generates a costly page fault that must be handled in software by the host by mapping the corresponding page to the I/O page table. The hardware management only helps for subsequent TLB misses on pages already mapped. Alternatively, all pages holding the shared data must be mapped at ofﬂoad time, which is impracticable when operating on pointer-rich data structures. Finally, due to the decoupling of the I/O and the process’ page table, the only way to ensure that the IOMMU does not use stale page-table data at any time is to prevent the mapped pages from being moved by page pinning, which further aggravates the cost for mapping and page fault handling.

iaGuoZhi commented 2 years ago

作者为什么说要在FPGA上实现一个soft IOMMU而不是用host上的IOMMU:

IOMMU硬件不能够处理page fault，第一次page fault需要软件来做，耗时巨大
host IOMMU 不支持page fault，需要pin内存，这对于大量访问指针的数据结构的workload是不现实的
由于页表同时被CPU和IOMMU用，两者要保证读到同一个物理页，因此运行中不能够更改映射

iaGuoZhi commented 2 years ago

我感觉为IO设备pin内存可能现在对于一些设备RDMA，VFIO能够work，但是对于之后的众多异构设备来说，需要它们和CPU一起做一件事情，不可能在运行前就确定它们访问哪块内存，pin内存不是可持续的办法。实现iopf是必要的

iaGuoZhi commented 2 years ago

In the case of a TLB miss or page fault, the interrupt handler inside the driver module simply triggers the execution of the worker thread in normal process context. Once this worker thread gets scheduled, it ﬁrst reads the address and transaction attributes from the IOMMU hardware and pins the requested userspace page in memory using get_user_pages(). Then, it maps the pinned page to the I/O page table in case the hard-macro IOMMU is used, or performs virtual-to-physical address translation and sets up a new entry in the TLB if the soft IOMMU is used.

hard-macro 的IOMMU在非PRI的io page fault不是non recoverable的吗？怎么又能够处理了呢?

iaGuoZhi commented 2 years ago

If a transaction misses in the TLB, its VA, ID and the AXI User Signals are stored inside the miss ﬁrst-in, ﬁrstout buffers (FIFOs), and an interrupt is sent to the host CPU.

如果IOTLB不存在，就让host来建立映射，这跟目前PCI设备里面提出的PRI, ATS概念好像

iaGuoZhi commented 2 years ago

In parallel, the IOMMU drops the transaction and signals a slave error in the AXI Read/Write Response back to the wrapper core inside the FPGA. The IOMMU does not block and can continue to handle address translations from other transactions to shared memory issued by the accelerator.

FPGA上的IOMMU在出现page fault后，会abort当前的transaction，继续执行其他transaction

iaGuoZhi commented 2 years ago

Our results show that, due to limitations in the low-level drivers and kernel APIs, the performance of hardmacro IOMMUs can be dominated by handling page faults.

iaGuoZhi / paper-readnotes