neurosim / 3D_NeuroSim_V1.0

Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration
19 stars 9 forks source link

different result #2

Open yzh20020301 opened 11 months ago

yzh20020301 commented 11 months ago

I try to get the resullt of the 2D 7nm SRAM. I use 8-bit VGG-8 network on CIFAR-10 dataset. The VGG-8 network model is from DNN_NeuroSim_V1.4.

I set memcelltype = 1, novelMapping = true, SARADC = true, validated = false, synchronous = false, pipeline = false, M3D = false, technode = 7, featuresize = 18e-9, wireWidth = 1, levelOutput = 16, cellBit = 1, heightInFeatureSizeSRAM = 16, widthInFeatureSizeSRAM = 34.43, widthSRAMCellNMOS = 1, numColMuxed = 8

But I get the readDynamicEnergy is: 9.62642e+07pJ. It is different with the result in 'Benchmarking Monolithic 3D Integration for Compute-in-Memory Accelerators: Overcoming ADC Bottlenecks and Maintaining Scalability to 7nm or Beyond ' which is: Area: 8.36mm^2, TOPS/W: 30.30, TOPS: 1.95, Power Density: 7.72e-03 W/mm^2, latency: 600us, dynamic energy: 35uJ

Do you have any suggestions to help me get the results similar to those in the paper?

My result is here.

------------------------------ Summary --------------------------------

ChipArea : 9.46458e+06um^2
Chip total CIM array : 3.52389e+06um^2
Total IC Area on chip (Global and Tile/PE local): 931046um^2
Total ADC (or S/As and precharger for SRAM) Area on chip : 2.04312e+06um^2
Total Accumulation Circuits (subarray level: adders, shiftAdds; PE/Tile/Global level: accumulation units) on chip : 1.80574e+06um^2
Other Peripheries (e.g. decoders, mux, switchmatrix, buffers, pooling and activation units) : 1.16078e+06um^2

Chip layer-by-layer readLatency (per image) is: 603729ns
Chip total readDynamicEnergy is: 9.62642e+07pJ
Chip total leakage Energy is: 6.02362e+06pJ
Chip total leakage Power is: 7531.8uW
Chip buffer readLatency is: 314434ns
Chip buffer readDynamicEnergy is: 236904pJ
Chip ic readLatency is: 65154.7ns
Chip ic readDynamicEnergy is: 3.45468e+06pJ

************************ Breakdown of Latency and Dynamic Energy *************************

----------- ADC (or S/As and precharger for SRAM) readLatency is : 173409ns
----------- Accumulation Circuits (subarray level: adders, shiftAdds; PE/Tile/Global level: accumulation units) readLatency is : 10241.2ns
----------- Other Peripheries (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readLatency is : 420079ns
----------- ADC (or S/As and precharger for SRAM) readDynamicEnergy is : 8.11379e+07pJ
----------- Accumulation Circuits (subarray level: adders, shiftAdds; PE/Tile/Global level: accumulation units) readDynamicEnergy is : 8.23443e+06pJ
----------- Other Peripheries (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readDynamicEnergy is : 6.8919e+06pJ

************************ Breakdown of Latency and Dynamic Energy *************************

----------------------------- Performance -------------------------------
Chip Operation Temperature (K): 313
Energy Efficiency TOPS/W (Layer-by-Layer Process): 12.0428
Throughput TOPS (Layer-by-Layer Process): 2.04038
Throughput FPS (Layer-by-Layer Process): 1656.37
Compute efficiency TOPS/mm^2 (Layer-by-Layer Process): 0.21558
Power Density W/mm^2 (Layer-by-Layer Process): 0.0179011
-------------------------------------- Hardware Performance Done --------------------------------------

My 'Param.cpp' is here.

Param::Param() {
    /***************************************** user defined design options and parameters *****************************************/
    operationmode = 2;          // 1: conventionalSequential (Use several multi-bit RRAM as one synapse)
                                // 2: conventionalParallel (Use several multi-bit RRAM as one synapse)

    memcelltype = 1;            // 1: cell.memCellType = Type::SRAM
                                // 2: cell.memCellType = Type::RRAM
                                // 3: cell.memCellType = Type::FeFET

    accesstype = 1;             // 1: cell.accessType = CMOS_access
                                // 2: cell.accessType = BJT_access
                                // 3: cell.accessType = diode_access
                                // 4: cell.accessType = none_access (Crossbar Array)

    transistortype = 1;         // 1: inputParameter.transistorType = conventional

    deviceroadmap = 2;          // 1: inputParameter.deviceRoadmap = HP
                                // 2: inputParameter.deviceRoadmap = LSTP

    globalBufferType = false;    // false: register file
                                // true: SRAM
    globalBufferCoreSizeRow = 128;
    globalBufferCoreSizeCol = 128;

    tileBufferType = false;      // false: register file
                                // true: SRAM
    tileBufferCoreSizeRow = 32;
    tileBufferCoreSizeCol = 32;

    peBufferType = false;        // false: register file
                                // true: SRAM

    chipActivation = true;      // false: activation (reLu/sigmoid) inside Tile
                                // true: activation outside Tile

    reLu = true;                // false: sigmoid
                                // true: reLu

    novelMapping = true;        // false: conventional mapping
                                // true: novel mapping

    SARADC = true;              // false: MLSA
                                // true: sar ADC
    currentMode = true;         // false: MLSA use VSA
                                // true: MLSA use CSA

    pipeline = false;            // false: layer-by-layer process --> huge leakage energy in HP
                                // true: pipeline process
    speedUpDegree = 8;          // 1 = no speed up --> original speed
                                // 2 and more : speed up ratio, the higher, the faster
                                // A speed-up degree upper bound: when there is no idle period during each layer --> no need to further fold the system clock
                                // This idle period is defined by IFM sizes and data flow, the actual process latency of each layer may be different due to extra peripheries

    validated = false;          // false: no calibration factors
                                // true: validated by silicon data (wiring area in layout, gate switching activity, post-layout performance drop...)

    synchronous = false;            // false: asynchronous
                                // true: synchronous, clkFreq will be decided by sensing delay

    M3D = false;                 // false: run 2D simulation
                                // true: run M3D simulation

    /*** algorithm weight range, the default wrapper (based on WAGE) has fixed weight range of (-1, 1) ***/
    algoWeightMax = 1;
    algoWeightMin = -1;

    /*** conventional hardware design options ***/
    clkFreq = 1e9;                      // Clock frequency
    temp = 300;                         // Temperature (K)
    // technode: 130     --> wireWidth: 175
    // technode: 90      --> wireWidth: 110
    // technode: 65      --> wireWidth: 105
    // technode: 45      --> wireWidth: 80
    // technode: 32      --> wireWidth: 56
    // technode: 22      --> wireWidth: 40
    // technode: 14      --> wireWidth: 25
    // technode: 10, 7   --> wireWidth: 18
    technode = 7;                      // Technology
    featuresize = 18e-9;                // Wire width for subArray simulation
    wireWidth = 18;                     // wireWidth of the cell for Accuracy calculation
    globalBusDelayTolerance = 0.1;      // to relax bus delay for global H-Tree (chip level: communication among tiles), if tolerance is 0.1, the latency will be relax to (1+0.1)*optimalLatency (trade-off with energy)
    localBusDelayTolerance = 0.1;       // to relax bus delay for global H-Tree (tile level: communication among PEs), if tolerance is 0.1, the latency will be relax to (1+0.1)*optimalLatency (trade-off with energy)
    treeFoldedRatio = 4;                // the H-Tree is assumed to be able to folding in layout (save area)
    maxGlobalBusWidth = 2048;           // the max buswidth allowed on chip level (just a upper_bound, the actual bus width is defined according to the auto floorplan)
                                        // NOTE: Carefully choose this number!!!
                                        // e.g. when use pipeline with high speedUpDegree, i.e. high throughput, need to increase the global bus width (interface of global buffer) --> guarantee global buffer speed

    numRowSubArray = 128;               // # of rows in single subArray
    numColSubArray = 128;               // # of columns in single subArray

    /*** option to relax subArray layout ***/
    relaxArrayCellHeight = 0;           // relax ArrayCellHeight or not
    relaxArrayCellWidth = 0;            // relax ArrayCellWidth or not

    numColMuxed = 8;                    // How many columns share 1 ADC (for eNVM and FeFET) or parallel SRAM
    levelOutput = 16;                   // # of levels of the multilevelSenseAmp output, should be in 2^N forms; e.g. 32 levels --> 5-bit ADC
    cellBit = 1;                        // precision of memory device 

    /*** parameters for SRAM ***/
    // due the scaling, suggested SRAM cell size above 22nm: 160F^2
    // SRAM cell size at 14nm: 300F^2
    // SRAM cell size at 10nm: 400F^2
    // SRAM cell size at 7nm: 600F^2
    heightInFeatureSizeSRAM = 16;        // SRAM Cell height in feature size  
    widthInFeatureSizeSRAM = 34.43;        // SRAM Cell width in feature size  
    widthSRAMCellNMOS = 1;                            
    widthSRAMCellPMOS = 1;
    widthAccessCMOS = 1;
    minSenseVoltage = 0.1;

    /*** parameters for analog synaptic devices ***/
    heightInFeatureSize1T1R = 4;        // 1T1R Cell height in feature size
    widthInFeatureSize1T1R = 12;         // 1T1R Cell width in feature size
    heightInFeatureSizeCrossbar = 2;    // Crossbar Cell height in feature size
    widthInFeatureSizeCrossbar = 2;     // Crossbar Cell width in feature size

    resistanceOn = 6e3;               // Ron resistance at Vr in the reported measurement data (need to recalculate below if considering the nonlinearity)
    resistanceOff = 6e3*150;           // Roff resistance at Vr in the reported measurement dat (need to recalculate below if considering the nonlinearity)
    maxConductance = (double) 1/resistanceOn;
    minConductance = (double) 1/resistanceOff;

    readVoltage = 0.5;                  // On-chip read voltage for memory cell
    readPulseWidth = 10e-9;             // read pulse width in sec
    accessVoltage = 1.1;                // Gate voltage for the transistor in 1T1R
    resistanceAccess = resistanceOn*IR_DROP_TOLERANCE;            // resistance of access CMOS in 1T1R
    writeVoltage = 2;                   // Enable level shifer if writeVoltage > 1.5V

    /*** Calibration parameters ***/
        alpha = 1.44;   // wiring area of level shifter
        beta = 1.4;     // latency factor of sensing cycle
        gamma = 0.5;    // switching activity of DFF in shifter-add and accumulator
        delta = 0.15;   // switching activity of adder 
        epsilon = 0.05; // switching activity of control circuits
        zeta = 1.22;    // post-layout energy increase

    /***************************************** user defined design options and parameters *****************************************/

    /***************************************** Initialization of parameters NO need to modify *****************************************/

    if (memcelltype == 1) {
        cellBit = 1;             // force cellBit = 1 for all SRAM cases

    /*** initialize operationMode as default ***/
    conventionalParallel = 0;
    conventionalSequential = 0;
    BNNparallelMode = 0;                
    BNNsequentialMode = 0;              
    XNORsequentialMode = 0;          
    XNORparallelMode = 0;         
    switch(operationmode) {
        case 6:     XNORparallelMode = 1;               break;     
        case 5:     XNORsequentialMode = 1;             break;     
        case 4:     BNNparallelMode = 1;                break;     
        case 3:     BNNsequentialMode = 1;              break;     
        case 2:     conventionalParallel = 1;           break;     
        case 1:     conventionalSequential = 1;         break;     
        default:    printf("operationmode ERROR\n");    exit(-1);

    /*** parallel read ***/
    parallelRead = 0;
    if(conventionalParallel || BNNparallelMode || XNORparallelMode) {
        parallelRead = 1;
    } else {
        parallelRead = 0;

    /*** Initialize interconnect wires ***/
    switch(wireWidth) {
        case 175:   AR = 1.60; Rho = 2.20e-8; break;  // for technode: 130
        case 110:   AR = 1.60; Rho = 2.52e-8; break;  // for technode: 90
        case 105:   AR = 1.70; Rho = 2.68e-8; break;  // for technode: 65
        case 80:    AR = 1.70; Rho = 3.31e-8; break;  // for technode: 45
        case 56:    AR = 1.80; Rho = 3.70e-8; break;  // for technode: 32
        case 40:    AR = 1.90; Rho = 4.03e-8; break;  // for technode: 22
        case 25:    AR = 2.00; Rho = 5.08e-8; break;  // for technode: 14
        case 18:    AR = 2.00; Rho = 6.35e-8; break;  // for technode: 7, 10
        case -1:    break;  // Ignore wire resistance or user define
        default:    exit(-1); puts("Wire width out of range"); 

    if (memcelltype == 1) {
        wireLengthRow = wireWidth * 1e-9 * heightInFeatureSizeSRAM;
        wireLengthCol = wireWidth * 1e-9 * widthInFeatureSizeSRAM;
    } else {
        if (accesstype == 1) {
            wireLengthRow = wireWidth * 1e-9 * heightInFeatureSize1T1R;
            wireLengthCol = wireWidth * 1e-9 * widthInFeatureSize1T1R;
        } else {
            wireLengthRow = wireWidth * 1e-9 * heightInFeatureSizeCrossbar;
            wireLengthCol = wireWidth * 1e-9 * widthInFeatureSizeCrossbar;
    Rho *= (1+0.00451*abs(temp-300));
    if (wireWidth == -1) {
        unitLengthWireResistance = 1.0; // Use a small number to prevent numerical error for NeuroSim
        wireResistanceRow = 0;
        wireResistanceCol = 0;
    } else {
        unitLengthWireResistance =  Rho / ( wireWidth*1e-9 * wireWidth*1e-9 * AR );
        wireResistanceRow = unitLengthWireResistance * wireLengthRow;
        wireResistanceCol = unitLengthWireResistance * wireLengthCol;
    /***************************************** Initialization of parameters NO need to modify *****************************************/
neurosim commented 10 months ago

I'm not positive as I didn't contribute to that paper but it looks like the ADC is creating the additional dynamic energy. Have you tried using Flash ADC? You could also try both current-mode ADC and voltage-mode ADC. You could also enable the chip validated mode (validated=true). Finally, I'd recommend running the 2D simulations in NeuroSim V1.4 to see if there are any discrepancies between 3D NeuroSim and NeuroSim V1.4. Hope this helps!