harvard-acc / smaug

SMAUG: Simulating Machine Learning Applications Using Gem5-Aladdin
https://harvard-acc.github.io/smaug_docs
BSD 3-Clause "New" or "Revised" License
96 stars 27 forks source link

Fail to replace accelerator spad with cache: `RuntimeError: Unknown array "inputs".` #110

Open suchandler96 opened 1 year ago

suchandler96 commented 1 year ago

Hi, I'm trying to replace private spad in accelerators with private cache, according to Sam's suggestion in this thread. I also got from this thread that Allcache policy is currently removed, so I tried using AllAcp policy.

I modified smv-accel.cfg, replacing all the lines starting with partition,cyclic with cache,xxx,4 (where xxx is the name of array, and the 4 I guess means 4 bytes, since aladdin is using fp32.). However, stderr gives me something like Unknown array "inputs". Please ensure that you have declared this array in your Aladdin configuration file with the correct partition type, size, and partition factor, and that you have not renamed pointers in your code (except through function calls). in the last line.

And I also tried not using cache (i.e., the smv-accel.cfg remains unmodified so accelerators have private spads), which transfers data with the rest of the memory system using an ACP port. This case does not throw an error and outputs all the stats. So I guess I may have done something wrong with the cache + ACP setting. I'm attaching the zipped directories of both cases here for diagnostics.

Would appreciate if anyone could give some suggestions!

xyzsam commented 1 year ago

Sorry for the very late response.

As I explained in that thread, the AllAcp policy simply indicates how the data is copied, not where the data is stored. That means you can use AllAcp with scratchpads. SMAUG will copy data into the scratchpads over the ACP interface. If you map all your arrays to caches, then Aladdin thinks you actually have a hardware cache attached to your accelerator. In which case there's no need for all the ACP/DMA data copy stuff, because you could just directly access the data with a pointer dereference.

TL;DR: don't combine cache,xxx,4 with AllAcp. SMAUG has been tested with using DMA and ACP to copy data into private scratchpads. It's certainly conceivable to attach a hardware cache and access the data directly, but I don't think the current code supports that - by which I mean that SMAUG would need to bypass all the memcpy work.

suchandler96 commented 1 year ago

Thanks for suggestion! And yet another question (which is also related to the code structure) is how is the proportion of "Accel compute" and "Data Transfer" counted in this repo? To my understanding of the code, the ScopedStats class acts as a time-stamp maker, who makes a time stamp whenever the constructor or destructor is called. But from this function, I found "Tensor preparation end" phase basically means the tiling time, "Tensor finalization end" means untiling time, and "Tensor finalization start" means the whole accelerator running time (which I think includes both data transfer time and pure computation time). So how is the data transfer time and computation time distinguished from each other? Any help is appreciated, thanks!