stonne-simulator / stonne

STONNE: A Simulation Tool for Neural Networks Engines
MIT License
119 stars 29 forks source link
dnn hardware-acceleration hardware-designs simulator

STONNE: A Simulation Tool for Neural Networks Engines

Bibtex

Please, if you use STONNE, please cite us:

@INPROCEEDINGS{STONNE21,
  author =       {Francisco Mu{\~n}oz-Matr{\'i}nez and Jos{\'e} L. Abell{\'a}n and Manuel E. Acacio and Tushar Krishna},
  title =        {STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators},
  booktitle =    {2021 IEEE International Symposium on Workload Characterization (IISWC)}, 
  year =         {2021},
  volume =       {},
  number =       {},
  pages =        {},
}

Docker image

We have created a docker image for STONNE! Everything is installed in the image so using the simulator is much easier. Also, this image comes with OMEGA and SST-STONNE simulators. More information can be found in our Dockerfile repository.

To pull and run the container, just type the following command:

docker run -it stonnesimulator/stonne-simulators

What is STONNE

The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. While first-generation accelerator proposals used simple fixed dataflows tailored for dense DNNs, more recent architectures have argued for flexibility to efficiently support a wide variety of layer types, dimensions, and sparsity. As the complexity of these accelerators grows, it becomes more and more appealing for researchers to have cycle-level simulation tools at their disposal to allow for fast and accurate design-space exploration, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, we present STONNE (Simulation TOol of Neural Network Engines), a cycle-level, highly-modular and highly-extensible simulation framework that can plug into any high-level DNN framework as an accelerator device and perform end-to-end evaluation of flexible accelerator microarchitectures with sparsity support, running complete DNN models.

Design of STONNE

alt text

STONNE is a cycle-level microarchitectural simulator for flexible DNN inference accelerators. To allow for end-to-end evaluations, the simulator is connected with a Deep Learning (DL) framework (Caffe and Pytorch DL frameworks in the current version). Therefore, STONNE can fully execute any dense and sparse DNN models supported by the DL framework that uses as its front-end. The simulator has been written entirely in C++, following the well-known GRASP and SOLID programming principles of object-oriented design. This has simplified its development and makes it easier the implementation of any kind of DNN inference accelerator microarchitecture, tile configuration mappings and dataflows.

The figure presented above shows a high-level view of STONNE with the three major modules involved in the end-to-end simulation flow:

Flexible DNN Architecture

This constitutes the principal block of STONNE (see the central block in the figure), and it is mainly composed of the modeled microarchitecture of the flexible DNN accelerator (Simulation Engine) to simulate. The source code of this main block is the foundation of the simulator and is located in the 'stonne/src' folder. This contains different classes for the different components of the architecture as well as the principal class 'STONNEModel' which begins the execution of the simulator. This file contains the main functions to call the simulator and therefore can be view as the 'STONNE API' which is the manner in which the input module can interact with the simulated accelerator, configuring its simulation engine according to the user configuration file, enabling different execution modes such as LFF, loading a certain layer and tile, and transferring the weights and the inputs to the simulator memory.

Input Module

Due to the flexibility that the \texttt{STONNE API} provides, the simulator can be fed easily using any of the well-known DL frameworks already available. Currently the simulator supports both Caffe and Pytorch DL frameworks and both front-ends with its connections are located in the folders 'pytorch-frontend' and 'caffe-frontend' respectively. Later in this file, we will explain how to install and run every framework on STONNE.

Furthermore, since these DL frameworks require a more complicated installation and use, apart from this mode of execution, we have also enabled the "STONNE User Interface" that facilitates the execution of STONNE. Through this mode, the user is presented with a prompt to load any layer and tile parameters onto a selected instance of the simulator, and run it with random weights and input values. This allows for faster executions, facilitating rapid prototyping and debugging. This interface can be launched directly in the 'stonne' folder once compiled, and the code is located in 'src/main.cpp' file. Basically, it is a command line that according to hardware and dimensino parameters, allow to run the simulator with random tensors.

Output module

Once a simulation for a certain layer has been completed, this module is used for reporting simulation statistics such as performance, compute unit utilization, number of accesses to SRAM, wires and FIFOs, etc. Besides, this output module also reports the amount of energy consumed and the on-chip area required by the simulated architecture. These statistics obviously depend on the particular data format (e.g., fp16 or int8) utilized to represent the DNN model's parameters. So, STONNE supports different data formats in the whole end-to-end evaluation process and statistics report. For estimating both area and energy consumption, STONNE utilizes a table-based area and energy model, computing total energy using the cycle-level activity stats for each module. For the current table-based numbers existend in STONNE (see 'stonne/energy_tables/' path), we ran synthesis using Synopsys Design-Compiler and place-and-route using Cadence Innovus on each module inside the MAERI and SIGMA RTL to populate the table. Users can plug in the numbers for their own implementations as well.

STONNE Mapper: Module for automatic tile generation

Flexible architectures need to be configured previously to execute any layer, commonly using a tile specification. The specified tile will drastically vary the results obtained in each execution, so choosing a good configuration is an important step when doing simulations. However, finding an optimal configuration is a task that requieres so much time or a deep knowledge of how each mapping would work.

For this reasson, STONNE includes a module called STONNE Mapper. It is a mapper which can automatically generate, given the hardware configuration and the specification of the layer to be simualated, a tile configuration that will be close to the optimum one in most cases. When used, the module will generate an output file with some extra information of the process used to select the tile. STONNE Mapper is an excellent alternative for fast prototyping and a good helper in finding the optimal tile configuration when the search space is very large.

Currently, STONNE Mapper gives support to generate mappings of CONV, FC/DenseGEMM and SparseDense layers (how to use it in every case is explained later). For CONV mappings it integrates and use mRNA, another mapper designed for MAERI presented on ISPASS-2019. For FC/DenseGEMM and SparseDense it implements its own algorithms. In the most of the cases, the generated mappings gets more than the 90% of the performance from optimum configuration (based on the results obtained in our benchmarks), so it has a high trustly degree.

This module and all of the implementations were made in a Final Degree Project in June 2022. Any question about its use or implementation can be made by contacting the author directly (@Adrian-2105).

Supported Architectures

STONNE models all the major components required to build both first-generation rigid accelerators and next-generationflexible DNN accelerators. All the on-chip components are interconnected by using a three-tier network fabric composed of a Distribution Network(DN), a Multiplier Network (MN), and a Reduce Network(RN), inspired by the taxonomy of on-chip communication flows within DNN accelerators. These networks canbe configured to support any topology. Next, we describe the different topologies of the three networks (DN, MN and RN) currently supported in STONNE that are basic building blocks of state-of-the-art accelerators such as the Google’s TPU, Eyeriss-v2, ShDianNao, SCNN, MAERI and SIGMA, among others. These building blocks can also be seen in the figure presented below:

alt text

STONNE User Interface. How to run STONNE quickly.

The STONNE User Interface facilitates the execution of STONNE. Through this mode, the user is presented with a prompt to load any layer and tile parameters onto a selected instance of the simulator, and runs it with random tensors.

Installation

The installation of STONNE, along with its user interface, can be carried out by typing the next commands:

cd stonne
make all

These commands will generate a binary file stonne/stonne. This binary file can be executed to run layers and gemms with any dimensions and any hardware configuration. All the tensors are filled using random numbers.

How to run STONNE

Currently, STONNE runs 5 types of operations: Convolution Layers, FC Layers, Dense GEMMs, Sparse GEMMs and SparseDense GEMMs. Please, note that almost any kernel can be, in the end, mapped using these operations. Others operations such as pooling layers will be supported in the future. However, these are the operations that usually dominate the execution time in machine learning applications. Therefore, we believe that they are enough to perform a comprehensive and realistic exploration. Besides, note that a sparse convolution might be also supported as all the convolution layers can be converted into a GEMM operation using the im2col algorithm.

The syntax of a STONNE user interface command to run any of the available operations is as follows:

./stonne [-h | -CONV | -FC | -DenseGEMM | -SparseGEMM | SparseDense] [Hardware parameters] [Dimension and tile Parameters]

Help Menu

A help menu will be shown when running the next command:

./stonne -h: Obtain further information to run STONNE

Hardware Parameters

The hardware parameters are common for all the kernels. Other parameters can be easily implemented in the simulator. Some parameters are tailored to some specific architectures.

Dimension and tile Parameters

Obviously, the dimensions of the kernel depends on the type of the operation that is going to be run.

If you intend to use STONNE Mapper to generate the tile configuration, note that the tile parameters (T_x) will be ignored and STONNE will only use the configuration it generates. In the same way, if you use STONNE Mapper, there is not need for the user to manually specify the tile parameters.

Next, it is described the different parameters according to each supported operation:

Examples

Example running a CONV layer (manual mapping):

./stonne -CONV -R=3 -S=3 -C=6 -G=1 -K=6 -N=1 -X=20 -Y=20 -T_R=3 -T_S=3 -T_C=1 -T_G=1 -T_K=1 -T_N=1 -T_X_=3 -T_Y_=1 -num_ms=64 -dn_bw=8 -rn_bw=8

Example running a CONV layer generating the tile with STONNE Mapper (energy target):

./stonne -CONV -R=3 -S=3 -C=6 -G=1 -K=6 -N=1 -X=20 -Y=20 -generate_tile=energy -num_ms=64 -dn_bw=8 -rn_bw=8 -accumulation_buffer=1

Example running a FC layer (manual mapping):

./stonne -FC -M=20 -N=20 -K=256 -num_ms=256 -dn_bw=64 -rn_bw=64 -T_K=64 -T_M=2 -T_N=1

Example running a FC layer generating the tile with STONNE Mapper (with mRNA algorithm and performance target):

./stonne -FC -M=20 -N=20 -K=256 -generate_tile=performance -generator=mRNA -num_ms=256 -dn_bw=64 -rn_bw=64 -accumulation_buffer=1

Example of running a DenseGEMM (manual mapping):

/stonne -DenseGEMM -M=20 -N=20 -K=256 -num_ms=256 -dn_bw=64 -rn_bw=64 -T_K=64 -T_M=2 -T_N=1

Example of running a DenseGEMM over TPU:

./stonne -DenseGEMM -M=4 -N=4 -K=16 -ms_rows=4 -ms_cols=4 -dn_bw=8 -rn_bw=16  -T_N=4 -T_M=1 -T_K=1 -accumulation_buffer=1 -rn_type="TEMPORALRN" -mn_type="OS_MESH" -mem_ctrl="TPU_OS_DENSE"

Example of running a SparseGEMM:

./stonne -SparseGEMM -M=20 -N=20 -K=256 -num_ms=128 -dn_bw=64 -rn_bw=64  -MK_sparsity=80 -KN_sparsity=10 -dataflow=MK_STA_KN_STR

Example of running a SparseDense (manual mapping):

./stonne -SparseDense -M=20 -N=20 -K=256 -MK_sparsity=80 -T_N=4 -T_K=32 -num_ms=128 -dn_bw=64 -rn_bw=64 -accumulation_buffer=1

Note that accumulation buffer needs to be set to 1 for the SparseDense case to work

Example of running a SparseDense generating the tile with STONNE Mapper (performance [1] target):

./stonne -SparseDense -M=20 -N=20 -K=256 -MK_sparsity=80 -generate_tile=1 -num_ms=128 -dn_bw=64 -rn_bw=64 -accumulation_buffer=1

Output

Every layer execution generates three files in the path in which the simulator has been executed (the env variable OUTPUT_DIR can be set to indicate another output path):

Note that after the execution, the results obtained in the output tensor by the simulator are compared with a CPU algorithm to ensure the correctness of the simulator. Note that if the simulator does not output the correct results, an assertion will raise at the end of the execution.

Generating Energy Numbers

In order to generate the energy consumption of the execution we have developed a Python script that takes in the counters file generated during the execution and a table-based energy model. The script is located in energy_tables folder and can be run by means of the next command:

./calculate_energy.py [-v] -table_file=<Energy numbers file> -counter_file=<Runtime counters file> -[out_file=<output file>]

The current energy numbers are located in the file energy_tables/energy_model.txt. We obtained the energy numbers through synthesis using Synopsys Design-Compiler and place-and-route using Cadence Innovus on each module inside the MAERI and SIGMA RTL to populate the table. Users can plug in the numbers for their own implementations as well.

PyTorch Frontend

At this point, the user must be familiar with the usage of STONNE and the set of statistics that the tool is able to output. However, with the STONNE user interface presented previously, the user must have realized that the inputs and outputs coming in and out in the simulator are random. Here, it is explained how to run real DNN models using pytorch and STONNE as a computing device.

The pytorch-frontend is located in the folder 'pytorch-frontend' and this basically contains the Pytorch official code Version 1.7 with some extra files to create the simulation operations and link them with the 'stonne/src' code. The current version of the frontend is so well-organized that running a pytorch DNN model on STONNE is straightforward.

Installation

First, you will need Python 3.6 or higher and a C++14 compiler. Also, we highly recommend to create an Anaconda environment before continue. Learn how to install it in their official documentation. Once installed, you can create and activate a new virtual environment with the next commands:

conda create --name stonne python=3.8 
conda activate stonne

Once the environment is ready, you have to install some installation dependencies before continue. You can install them with the next commands:

pip install pyyaml==6.0 setuptools==45.2.0 numpy==1.23.5 

Now you can start to install the PyTorch frontend. First, if you don't have CUDA on your computer, export the next variables:

export MAX_JOBS=1
export NO_CUDA=YES
export NO_CUDA=1

Second, you can build and install PyTorch (torch) from sources using the next commands (it takes around 20 minutes):

cd pytorch-frontend/
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install

Third, you can build and install our PyTorch frontend (torch_stonne) package using the next commands:

cd stonne_connection/
python setup.py install

Finally, to be able to run all the benchmarks, you will need to install some extra dependencies. We recommend you to install the specific versions listed below in order to avoid package dependency problems and overwriting the previous torch installation. You can install them with the next commands:

pip install transformers==4.25.1
pip install torchvision==0.8.2 --no-deps

You can check that the installation was successful by running the next test simulation:

python $STONNE_ROOT/pytorch-frontend/stonne_connection/test_simulation.py

Running PyTorch in STONNE

Running pytorch using STONNE as a device is almost straightforward. Let's assume we define a DNN model using PyTorch. This model is composed of a single and simple convolutional layer. Next, we present this code:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(5,5,5, groups=1) # in_channels=5, out_channels=5, filter_size=5
    def forward(self, x):
        x = self.conv1(x)
        return x

This code can be easily run in CPU just by means of creating an object of type Net and running the forward method with the correct tensor shape as input.

net = Net()
print(net)
input_test = torch.randn(5,50,50).view(-1,5,50,50)
result  = net(input_test)

Migrating this model to STONNE is as simple as turning the Conv2d operation into a SimulatedConv2d operation. Next, we can observe an example:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.SimulatedConv2d(5,5,5,'$PATH_TO_STONNE/simulation_files/maeri_128mses_128_bw.cfg', 'dogsandcats_tile.txt', sparsity_ratio=0.0, stats_path='.', groups=1) 
    def forward(self, x):
        x = self.conv1(x)
        return x

As we can see, we have inserted 4 new parameters:

The addition of these 4 parameters and the modification of the function will let PyTorch run the layer in STONNE obtaining the real tensors.

In the current version of the pytorch-frontend we also support nn.SimulatedLinear and torch_stonne.SimulatedMatmul operations that correspond with both nn.Linear and nn.Matmul operations in the original PyTorch framework. The only need is to change the name of the functions and indicate the 3 extra parameters. We still do not support SparseDense operations on the pytorch-frontend.

Simulation with real benchmarks

In order to reduce the effort of the user, we have already migrated some models to STONNE. By the moment, we have 4 DNN benchmarks in this framework: Alexnet, SSD-mobilenets, SSD-Resnets1.5 and BERT. All of them are in the folder 'benchmarks'. Note that to migrate these models, we have had to understand the code of all of them, locate the main kernels (i.e., convolutions, linear and matrix multiplication operations) and turn the functions into the simulated version. That is the effort you require to migrate a new model. We will update this list over time.

Running these models is straightforward as we have prepared a script (benchmarks/run_benchmarks.py file) that performs all the task automatically. Next, we present one example for each network:

cd benchmarks