harvard-acc / smaug

SMAUG: Simulating Machine Learning Applications Using Gem5-Aladdin
https://harvard-acc.github.io/smaug_docs
BSD 3-Clause "New" or "Revised" License
96 stars 27 forks source link

Memory policy of AllCache causes SMAUG to crash in simulation #102

Closed xyzsam closed 2 years ago

xyzsam commented 2 years ago

Reported by user daecheol.you@samsung.com:

During examining SMAUG simulator, I noticed that three memory interfaces are possible: DMA, ACP and cache.

It seems that ACP supports I/O coherency, and cache supports full coherency.

I ran the minerva sample model with ACP interrface succesfully, but there is a problem when the memory interface is set to cache.

Following is the procedure I took.

  • Generate minerva pbtxt and pb file with the memory policy of 'AllCache' by modifying the Python model script like below:

    with sg.Graph(name="minerva_smv_cache", backend="SMV", mem_policy=sg.AllCache) as graph:

  • Modify model_files so that 'topo_file' and 'params_file' point to the generated pbtxt and pb file.

  • Generate dynamic_trace_acc0.gz file with trace.sh script.

  • Modify 'memory_type' in gem5.cfg to cache

When simulation started, page mapping occurs like below:

40086105600: system.acc0_datapath: Setting host_a to memory type cache. 40086199200: system.acc0_datapath: Setting host_b to memory type cache. 40086235200: system.acc0_datapath: Setting host_results to memory type cache. 40090416000: system.acc0_datapath: Inserting array label mapping host_results -> vpn 0x3739a0, size 512. 40090416000: system.acc0_datapath: Mapping vaddr 0x3739a0 -> paddr 0x1a839a0. 40090416000: system.acc0_datapath: Inserting TLB entry vpn 0x373000 -> ppn 0x1a83000. 40092144000: system.acc0_datapath: Inserting array label mapping host_a -> vpn 0x3aafa0, size 1568. 40092144000: system.acc0_datapath: Mapping vaddr 0x3aafa0 -> paddr 0x1abafa0. 40092144000: system.acc0_datapath: Inserting TLB entry vpn 0x3aa000 -> ppn 0x1aba000. 40092144000: system.acc0_datapath: Mapping vaddr 0x3abfa0 -> paddr 0x1abbfa0. 40092144000: system.acc0_datapath: Inserting TLB entry vpn 0x3ab000 -> ppn 0x1abb000. 40093368000: system.acc0_datapath: Inserting array label mapping host_b -> vpn 0x3e4e80, size 25088. 40093368000: system.acc0_datapath: Mapping vaddr 0x3e4e80 -> paddr 0x1bd7e80. 40093368000: system.acc0_datapath: Inserting TLB entry vpn 0x3e4000 -> ppn 0x1bd7000. 40093368000: system.acc0_datapath: Mapping vaddr 0x3e5e80 -> paddr 0x1bd8e80. 40093368000: system.acc0_datapath: Inserting TLB entry vpn 0x3e5000 -> ppn 0x1bd8000. ... 40093368000: system.acc0_datapath: Mapping vaddr 0x3ebe80 -> paddr 0x1bdee80. 40093368000: system.acc0_datapath: Inserting TLB entry vpn 0x3eb000 -> ppn 0x1bde000.

However, at the start of exeuction, memory access to a strange address occurs:

40094148000: system.acc0_datapath: issueTLBRequestTiming for trace addr: 0xd3a8c0

Thus, simulation fails with the error message below:

fatal: An error occurred during cache access to trace virtual address 0xd3a8c0 at node 70: Could not find a virtual address mapping for array "". Please ensure that you have called mapArrayToAccelerator() with the correct array name parameter.

Did I configure something wrong or misunderstand SMAUG operations?

I would really appreciate if you give some advice for it.

I am able to reproduce this issue. It looks like it mostly due to not correctly looking up the array name for a host memory access when it is accessed directly via virtual memory (aka caching). I suspect that since this memory policy has not been used very heavily in the past, the code regressed relative to DMA or ACP, which has seen heavier use. Still investigating.

xyzsam commented 2 years ago

Apart from https://github.com/harvard-acc/ALADDIN/issues/43, the other question is why this only happens with MemoryPolicy = AllCache, rather than DMA or ACP. The reason here is that in gem5-aladdin, when we say "map this array to a cache", we really mean "replace the scratchpad for this array with an L1 cache entirely". It applies only to the memory on the accelerator side. But in the SMAUG context, the memory policy of "AllCache" just means "when you copy data from the host to the accelerator, get the data from the host through normal virtual memory, not by sending DMA or ACP requests". In other words, "caching" in a MemoryPolicy refers to the mechanism of how you get the data, not the physical place where data is stored.

This mode of fetching data is not supported in gem5-aladdin because it's really no different from using ACP as the transport mechanism. I will send a patch to remove AllCache as a valid MemoryPolicy, because it's not.

If you were interested in replacing the scratchpads on the accelerators with a cache, that's done differently. Go into the smv-accel.cfg file (which configures the accelerator) and replace all the "partition,cyclic" lines with "cache". Then configure the cache itself in gem5.cfg by changing memory_type=cache and updating cache_size=xxkB. This will require https://github.com/harvard-acc/smaug/pull/104 to fix a small bug with missing files. Once that's submitted, update SMAUG's submodules (git submodule update), and it should work for you (I just tested it).