schirner commented 1 year ago

Background Story

1. How to get larger transactions so that the simulation speed improves?

Use DMA for larger transfers. Then, GEMM ACC needs to support larger transactions (instantiate memory inside GEMM ACC).
- The kernel has specific api (Async TX). Example use. Explanation of memory spaces for DMA and their translation (link). In the end the crypto subsystem as a kernel option ASYNC_MEMCPY, which provides a (virtual to virtual?) memcpy. But I did not find the user level interface for it or how it is exposed. Need to search for something that uses ASYNC_CORE or ASYNC_MEMCPY in Kconfig.
  - only used internally for the RAID driver, not exposed
  - Better is probably looking at Xilinx DMA Proxy. This is for mem2dev, dev2mem DMAs in PL. To make it usable would need to steal some ideas from Xilinx DMA proxy but add target Physical address to it. To be more precise, we need a virtMem2PhysMem DMA. Xilinx CDMA is mem to mem it is implemented in the driver, but the example usage does not show it.
- Find old code for memcpy that had DMA support. I have measured it on RedHat ~2007 on an x86 architecture. memcpy did not use DMA for copying (at least not directly). Found memcpy source that uses ultimately vm_copy. Which is a copy on write of the virtual pages.
Use High Performance Port where ACC is master, let ACC pull data in larger transactions
Instantiate / implement a write back cache in QEMU. Maybe doable hard coded in remote port memory link

2. Avoid QEMU and directly go to SWEmu

Better than everything: move darknet into host compiled with timing annotations. Then, we have actual speeds and timing and don't need to worry about anything. This way students get also timing feedback.

schirner commented 1 year ago

Steps:

clone qemu
add write back cache as outlined
validate with systemc memory model that supports bursts (eitehr extened own or use:
https://github.com/Xilinx/libsystemctlm-soc/blob/cd0c84759f95bc98294ca1217ed81809088aee23/tests/test-modules/memory.cc
https://github.com/neu-ece-7368/systemc-examples/tree/main/temporal_decoupling

schirner commented 1 year ago

For write back cache:

make specific to our memory region (not for the MMR such as CSR)
any write outside that area: flushes write cache

read cache (specific to oru mmemoy area)

neu-ece-7368 / qemu-xlnx

Improve Speed #1

Background Story

1. How to get larger transactions so that the simulation speed improves?

2. Avoid QEMU and directly go to SWEmu