TCN models containing expand dims and conv1d layers induce "Arena size is too small issue"

swapnilsayansaha commented 3 years ago

I am trying to upload some lightweight TCN models to STM32L476RG Nucleo running Mbed OS.

Example model 1 (the tcn layer is from keras tcn library https://github.com/philipperemy/keras-tcn):

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 400, 6)]     0                                            
__________________________________________________________________________________________________
tcn (TCN)                       (None, 32)           31840       input_1[0][0]                    
__________________________________________________________________________________________________
tf.reshape (TFOpLambda)         (None, 32, 1)        0           tcn[0][0]                        
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, 16, 1)        0           tf.reshape[0][0]                 
__________________________________________________________________________________________________
flatten (Flatten)               (None, 16)           0           max_pooling1d[0][0]              
__________________________________________________________________________________________________
pre (Dense)                     (None, 32)           544         flatten[0][0]                    
__________________________________________________________________________________________________
velx (Dense)                    (None, 1)            33          pre[0][0]                        
__________________________________________________________________________________________________
vely (Dense)                    (None, 1)            33          pre[0][0]                        
==================================================================================================
Total params: 32,450
Trainable params: 32,450
Non-trainable params: 0

The above model consumes 162 kB of flash (1 MB capacity) and a total of ~50 kB (128 kB capacity) of RAM on the Mbed board (includes other dependencies) for arena of size 32X1000.

Example model 2 (which is even lighter!):

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            [(None, 140, 1)]     0                                            
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 140, 2)       22          input_2[0][0]                    
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 140, 2)       8           conv1d[0][0]                     
__________________________________________________________________________________________________
zero_padding1d (ZeroPadding1D)  (None, 140, 2)       0           batch_normalization[0][0]        
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 140, 11)      242         zero_padding1d[0][0]             
__________________________________________________________________________________________________
dropout (Dropout)               (None, 140, 11)      0           conv1d_1[0][0]                   
__________________________________________________________________________________________________
zero_padding1d_1 (ZeroPadding1D (None, 140, 11)      0           dropout[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 140, 11)      1331        zero_padding1d_1[0][0]           
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 140, 11)      44          conv1d_2[0][0]                   
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 140, 11)      22          batch_normalization[0][0]        
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 140, 11)      0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 140, 11)      44          conv1d_3[0][0]                   
__________________________________________________________________________________________________
add (Add)                       (None, 140, 11)      0           dropout_1[0][0]                  
                                                                 batch_normalization_2[0][0]      
__________________________________________________________________________________________________
activation (Activation)         (None, 140, 11)      0           add[0][0]                        
__________________________________________________________________________________________________
zero_padding1d_2 (ZeroPadding1D (None, 140, 11)      0           activation[0][0]                 
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 140, 11)      1331        zero_padding1d_2[0][0]           
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 140, 11)      44          conv1d_4[0][0]                   
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 140, 11)      0           batch_normalization_3[0][0]      
__________________________________________________________________________________________________
zero_padding1d_3 (ZeroPadding1D (None, 140, 11)      0           dropout_2[0][0]                  
__________________________________________________________________________________________________
conv1d_5 (Conv1D)               (None, 140, 11)      1331        zero_padding1d_3[0][0]           
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 140, 11)      44          conv1d_5[0][0]                   
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 140, 11)      0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
add_1 (Add)                     (None, 140, 11)      0           dropout_3[0][0]                  
                                                                 activation[0][0]                 
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 140, 11)      0           add_1[0][0]                      
__________________________________________________________________________________________________
zero_padding1d_4 (ZeroPadding1D (None, 140, 11)      0           activation_1[0][0]               
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 140, 11)      1331        zero_padding1d_4[0][0]           
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 140, 11)      44          conv1d_6[0][0]                   
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 140, 11)      0           batch_normalization_5[0][0]      
__________________________________________________________________________________________________
zero_padding1d_5 (ZeroPadding1D (None, 140, 11)      0           dropout_4[0][0]                  
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 140, 11)      1331        zero_padding1d_5[0][0]           
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 140, 11)      44          conv1d_7[0][0]                   
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 140, 11)      0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
add_2 (Add)                     (None, 140, 11)      0           dropout_5[0][0]                  
                                                                 activation_1[0][0]               
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 140, 11)      0           add_2[0][0]                      
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 1540)         0           activation_2[0][0]               
__________________________________________________________________________________________________
dense (Dense)                   (None, 5)            7700        flatten_1[0][0]                  
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 5)            0           dense[0][0]                      
==================================================================================================
Total params: 14,913
Trainable params: 14,777
Non-trainable params: 136

The above model in quantized form consumes 33 kB of flash and a total of ~50 kB (128 kB capacity) of RAM on the Mbed board (includes other dependencies) for arena of size 32X1000.

The Mbed code is as follows:

#include "tensorflow/lite/micro/micro_profiler.h"
#include "tensorflow/lite/micro/all_ops_resolver.h"
//#include "tensorflow/lite/micro/examples/hello_world/constants.h"
#include "model.h"
//#include "tensorflow/lite/micro/examples/hello_world/output_handler.h"
#include "tensorflow/lite/micro/micro_error_reporter.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/system_setup.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "mbed.h"
#include "BMX160/bmx160.h"

Timer t;
//+ = 5V, - = GND, C = SCL, D = SDA
const int numSamples = 400; //should b 400 for model 1 and 140 for model 2
int samplesRead = 0;

// Globals, used for compatibility with Arduino-style sketches.
namespace {
tflite::ErrorReporter* error_reporter = nullptr;
const tflite::Model* model = nullptr;
tflite::MicroInterpreter* interpreter = nullptr;
TfLiteTensor* input = nullptr;
TfLiteTensor* output = nullptr;
int inference_count = 0;
tflite::MicroProfiler* profiler = nullptr;

constexpr int kTensorArenaSize = 32*1000;
uint8_t tensor_arena[kTensorArenaSize];
}  // namespace

int main(int argc, char* argv[]) {
  uint32_t failures = 0;
  tflite::InitializeTarget();

  // Set up logging. Google style is to avoid globals or statics because of
  // lifetime uncertainty, but since this has a trivial destructor it's okay.
  // NOLINTNEXTLINE(runtime-global-variables)
  static tflite::MicroErrorReporter micro_error_reporter;
  error_reporter = &micro_error_reporter;

  static tflite::MicroProfiler micro_profiler_reporter;
  profiler = &micro_profiler_reporter;

  // Map the model into a usable data structure. This doesn't involve any
  // copying or parsing, it's a very lightweight operation.
  model = tflite::GetModel(g_model);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    printf("Model provided is schema version %d not equal "
                         "to supported version %d.",
                         model->version(), TFLITE_SCHEMA_VERSION);
    failures++;
  }

  // This pulls in all the operation implementations we need.
  // NOLINTNEXTLINE(runtime-global-variables)
  static tflite::AllOpsResolver resolver;

  // Build an interpreter to run the model with.
  static tflite::MicroInterpreter static_interpreter(
      model, resolver, tensor_arena, kTensorArenaSize, error_reporter,profiler);
  interpreter = &static_interpreter;

  // Allocate memory from the tensor_arena for the model's tensors.
  TfLiteStatus allocate_status = interpreter->AllocateTensors();
  if (allocate_status != kTfLiteOk) {
    printf("AllocateTensors() failed");
    failures++;
  }

  // Obtain pointers to the model's input and output tensors.
  input = interpreter->input(0);
  output = interpreter->output(0);

  // Keep track of how many inferences we have performed.
  inference_count = 0;

// Initialise I2C bus (PC_1 and PC_0 pins being used in this demo for Nucleo L476RG)
    I2C i2cBus(PB_9, PB_8);
    i2cBus.frequency(400000);
    BMI160_I2C imu(i2cBus, BMI160_I2C::I2C_ADRS_SDO_LO);

    //check if accel an gyro are ok
    if(imu.setSensorPowerMode(BMI160::GYRO, BMI160::NORMAL) != BMI160::RTN_NO_ERROR) {
        printf("Failed to set gyroscope power mode\n");
        failures++;
    }
    //thread_sleep_for(100);
    if(imu.setSensorPowerMode(BMI160::ACC, BMI160::NORMAL) != BMI160::RTN_NO_ERROR) {
        printf("Failed to set accelerometer power mode\n");
        failures++;
    }

    BMI160::AccConfig accConfig;
    //example of using getSensorConfig
    if(imu.getSensorConfig(accConfig) == BMI160::RTN_NO_ERROR) {
        printf("ACC Range = %d\n", accConfig.range);
        printf("ACC UnderSampling = %d\n", accConfig.us);
        printf("ACC BandWidthParam = %d\n", accConfig.bwp);
        printf("ACC OutputDataRate = %d\n\n", accConfig.odr);
    } else {
        printf("Failed to get accelerometer configuration\n");
        failures++;
    }

    //example of setting user defined configuration
    accConfig.range = BMI160::SENS_4G;
    accConfig.us = BMI160::ACC_US_OFF;
    accConfig.bwp = BMI160::ACC_BWP_2;
    accConfig.odr = BMI160::ACC_ODR_8;
    if(imu.setSensorConfig(accConfig) == BMI160::RTN_NO_ERROR) {
        printf("ACC Range = %d\n", accConfig.range);
        printf("ACC UnderSampling = %d\n", accConfig.us);
        printf("ACC BandWidthParam = %d\n", accConfig.bwp);
        printf("ACC OutputDataRate = %d\n\n", accConfig.odr);
    } else {
        printf("Failed to set accelerometer configuration\n");
        failures++;
    }

    BMI160::GyroConfig gyroConfig;
    if(imu.getSensorConfig(gyroConfig) == BMI160::RTN_NO_ERROR) {
        printf("GYRO Range = %d\n", gyroConfig.range);
        printf("GYRO BandWidthParam = %d\n", gyroConfig.bwp);
        printf("GYRO OutputDataRate = %d\n\n", gyroConfig.odr);
    } else {
        printf("Failed to get gyroscope configuration\n");
        failures++;
    }

    if(failures == 0) {
        BMI160::SensorData accData;
        BMI160::SensorData gyroData;
        BMI160::SensorTime sensorTime;

        while(1) {

            while(samplesRead<numSamples){
            //change the channel count as per model 1 (6) and model 2 (1)
            imu.getGyroAccXYZandSensorTime(accData, gyroData, sensorTime, accConfig.range, gyroConfig.range);
            input->data.f[0] = accData.xAxis.scaled;
            input->data.f[samplesRead * 6 + 1] = accData.yAxis.scaled;
            input->data.f[samplesRead * 6 + 2] = accData.zAxis.scaled;
            input->data.f[samplesRead * 6 + 3] = gyroData.xAxis.scaled*0.0174533;
            input->data.f[samplesRead * 6 + 4] = gyroData.yAxis.scaled*0.0174533;
            input->data.f[samplesRead * 6 + 5] = gyroData.xAxis.scaled*0.0174533;
            samplesRead++;
            printf("%d\n",samplesRead); 
            }
            samplesRead = 0;

            t.start();
            TfLiteStatus invoke_status = interpreter->Invoke();
            if (invoke_status != kTfLiteOk) {
                TF_LITE_REPORT_ERROR(error_reporter, "Invoke failed");
            while(1);
            }
            t.stop();
            printf("Latency: %f\n", t.read());
            t.reset();
            float out1 = output->data.f[0];
            float out2 = output->data.f[1];
            printf("Output: %f, %f",out1, out2);
            //thread_sleep_for(50);
        }
    } else {
        while(1){
            printf("Fatal Error\n");
        }

    }
}

The code compiles without any error. However, very surprisingly, during runtime, I get the following error for both models!

Arena size is too small for all buffers. Needed Arena size is too small for all buffers. Needed 4294960256 but only X was available. AllocateTensors() failed (X varies, depending on arena size).

This is surely some bug in tensorflow, because the 2nd model was shown to run on the exact same hardware via TF Lite Micro: https://github.com/pulp-platform/ecg-tcn.

TF 2.4 does not find expand dims, while TF 2.5 and 2.6 (this repo) complains about the arena size being unrealistic.

advaitjain commented 3 years ago

Here are some debugging tips that may be useful:

Start by ensuring that you can use the model with TFLM on x86 and then attempting to run inference on your target platform.
To that end, create a benchmark for your model, similar to https://github.com/tensorflow/tflite-micro/blob/169a27b2048fdfc58ad815f22de6267e4ea8e86a/tensorflow/lite/micro/benchmarks/person_detection_benchmark.cc so that the errors can be reproduced more easily.
On x86, you can make the arena size as large as needed, and then use the recording memory APIs to get more insight into what is happening for your specific model: https://github.com/tensorflow/tflite-micro/blob/169a27b2048fdfc58ad815f22de6267e4ea8e86a/tensorflow/lite/micro/docs/memory_management.md#recording-memory-apis

You may also want to raise an issue on https://github.com/pulp-platform/ecg-tcn to find out more details about their use of TFLM.

swapnilsayansaha commented 3 years ago

Some more information (some of which is relevant to running the model on x86):

When I try to load an interpreter object on x86 using tf.lite.Interpreter(args.tflite_model), I get the following error:

tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 90 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 97 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 106 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 113 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 122 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 129 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 138 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 145 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 154 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 161 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 170 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 177 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 186 is invalidly specified in schema.
tensorflow/lite/core/subgraph.cc BytesRequired number of elements overflowed.
Tensor 193 is invalidly specified in schema.

Possible solutions are given here, pointing to problems with input shape on CNNs: https://stackoverflow.com/questions/64882214/runtimeerror-when-calling-allocate-tensors-on-converted-tflite-model and https://stackoverflow.com/questions/63500096/tensorflow-lite-core-subgraph-cc-bytesrequired-number-of-elements-overflowed-no

However, I can't use the solution as I the interpreter won't even load the tflite model, neither am I using built in TF CNNs

swapnilsayansaha commented 3 years ago

Attached is the tflite model: https://drive.google.com/file/d/1swiQmoH_Rxy-f422eYnrVsDzIwpn-W7N/view?usp=sharing

swapnilsayansaha commented 3 years ago

I think I solved the problem. It's got to do with older versions of TF (TF 2.4) not supporting dilation rates greater than 1. I swtiched the training version of TF to 2.5 and everything seems to be working fine.

martingcavallo commented 3 years ago

Hi @swapnilsayansaha !

Something that I want to ask you if if you could convert the second model from (https://github.com/pulp-platform/ecg-tcn) using Conv1D or did you converted the equivalent model using Conv2D as explain in (https://github.com/pulp-platform/ecg-tcn).

I undestand that Conv1D is not still supported in tflite micro, isn't it?

Regards, Martin

swapnilsayansaha commented 3 years ago

@martingcavallo https://github.com/tensorflow/tensorflow/issues/39823 https://github.com/tensorflow/tensorflow/pull/28410

It was a bug with dilated conv support. It got fixed in recent versions of TF. I was able to convert my model to TFLM format without issues in TF 2.5++.

martingcavallo commented 3 years ago

Ok. So I undesrtand that Convolutional 1D layer is supported by tflite micro, is that right??

On the other hand, I want to deploy a model on an arm mbed enable board too. But I do not know how to follow after that I execute the following command:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=cortex_m_generic TARGET_ARCH=cortex-m4 OPTIMIZED_KERNEL_DIR=cmsis_nn microlite

Since June, when tflite-micro was moved to this repository, seems like the process to deploy changed a bit, and some tutorials like the ones that are in O'Reilly TinyML book seems to be obsolete or deprecated. Because I do not find the directories that are mentioned in these tutorials after executing make command.

Did you use the same make command? Did you use Mbed Studio?? How to integrate tflite-micro to Mbed Studio??

Sorry for a lot of quesetions.

Regards, Martin

swapnilsayansaha commented 3 years ago

https://www.youtube.com/watch?v=gDFWCxrJruQ

Check this out (although its not for Mbed but concepts are similar). Basically you write your own main.cpp file based on the examples given in TFLM (e.g. the hello world main.cpp file). You then copy all the necessary dependencies from your make folder as shown in the video to your Mbed project directory. And yes you can use Mbed Studio to do it.

Also I Used TF2.5 instead of 2.6 to compile directly for Mbed targets (target arch = mbed) They removed Mbed target in TF2.6.

tensorflow / tflite-micro

TCN models containing expand dims and conv1d layers induce "Arena size is too small issue" #438