uTensor / uTensor

TinyML AI inference library
Apache License 2.0
1.69k stars 225 forks source link

Universal Inference API #183

Open neil-tan opened 4 years ago

neil-tan commented 4 years ago

Abstract Individual frameworks such as uTensor and TFLM have their own sets of on-device APIs. In some cases, significant boilerplate code and framework-specific knowledge are required to implement an inference task at its simplest form. A developer-friendly universal high-level inference API will be valuable for on-device ML.

On-device inferencing is generalized into these steps:

The code snippets below aim to illustrate the current API designs for uTensor and Tensorflow. The newly proposed API will likely utilize the code-generation technology to create an adaptor layer between the universal interface and the underlying framework-specific APIs.

Examples:

uTensor:

  Context ctx;  //creating the context class, the stage where inferences take place 
  //wrapping the input data in a tensor class
  Tensor* input_x = new WrappedRamTensor<float>({1, 784}, (float*) input_data);
  get_deep_mlp_ctx(ctx, input_x);  // pass the tensor to the context
  S_TENSOR pred_tensor = ctx.get("y_pred:0");  // getting a reference to the output tensor
  ctx.eval(); //trigger the inference

TFLM: Please refer to this hello-world example

Requirements

The newly proposed API should have high-level abstraction aims to accelerate, simplify application development, and, helps to streamline the edge-ML deployment flow, especially for resource-constrained devices.

The new API should:

Proposals

  1. Single-function-call inferencing, by @janjongboom ,
    
    uint8_t utensor_mem_pool[4096]; // <-- CLI should tell me how much I need

utensor_something_autogenerated_init(utensor_mem_pool);

float input[33] = { 1,2,3,4 ... } float output[5];

utensor_run_something_autogenerated(input, 33, output, 5);


2.    Model object, discussion with @sandeepmistry, @mbartling  and @neil-tan  

```c++
char input_buffer[512];
int result[1];

MyModel model; //generated
model.setArenaSize(1024);
model.bind_input0(input_buffer, input_buffer_size);
model.bind_prediction0(result, 1);
model.run();

printf(“The inference result is: %d”, result[0]);

This is at its most minimal. The generated bind method names corresponding to the tensor names in the graph. The method’s signatures reflect their respective tensor-data-types. Additional methods can be implemented to support advanced configurations.

What’s Next

This issue serves as a starting point for this discussion. It will be reviewed by uTensor core-devs, Arduino, ISG data scientists, IPG engineers, and Google. We are be particular interested in reviewing use-cases which the current proposed API cannot cover. We are looking to reiterate and converge on a design in the next weeks.

neil-tan commented 4 years ago

char input_buffer[512]; ExampleTensorObject* input_tensor_obj; int result[1];

MyModel model; //generated model.setArenaSize(1024); model.bind_input0(input_buffer, shape, type); model.bind_input1(input_tensor_obj); model.bind_prediction0(result, 1); model.run();

printf(“The inference result is: %d”, result[0]);

mbartling commented 4 years ago

I think the metadata memory allocator should be fixed in size at model construction, but I am OK with the data scratchpad being on the heap.

Might look something like this: MyModel<MetaDataSize> model; model.setTensorDataMemSize(ScratchPadSize);

mbartling commented 4 years ago
template<size_t MetaDataSize=2048> 
class MyModel {
private:
  FixedTensorArenaAllocator<MetaDataSize> defaultMetaDataAllocator;
  DynamicTensorArenaAllocator defaultTensorDataAllocator;
...
};
neil-tan commented 4 years ago

We should be able to update the following draft to the re-arch without problem.

template<size_t MetaDataSize=2048> 
class MyModel {
private:
  //FixedTensorArenaAllocator<MetaDataSize> defaultMetaDataAllocator;
  //DynamicTensorArenaAllocator defaultTensorDataAllocator;

  Context& ctx;

public:

  //auto generated
  struct {
    Tensor* tensor0 = nullptr;
    Tensor* tensor1 = nullptr;
    Tensor* tensor2 = nullptr;
  } tensors;

  void run(void);

};

template<classtype T>
void copy_tensor<T>(S_TENSOR& tensor_src, S_TENSOR& tensor_dst) {
  for(size_t i = 0; i < tensor_src.getSize(); i++) {
    tensor_dst->write<T>(0, i) = *(tensor_src->read<T>(0, i));
  }
}

//auto generated
void MyModel::run(void) {
    //allocator to re-use the space in input tensors -> allow modify
    //and output tensors
    get_deep_mlp_ctx(ctx, tensors.tensor0, tensors.tensor1);

    ctx.eval();

    S_TENSOR result = ctx.get("tensor2");
    //copy the tensor out, as application should own the output memory

    copy_tensor(result, tensors.tensor2);

    ctx.gc();
  }

// Example

char input_buffer[512];
ExampleTensorObject* input_tensor_obj; //a class with Tensor interface
int result[1];

MyModel model; //generated
model.tensors.tensor0 = RamTensor({10, 10}), i8);
model.tensors.tensor1 = WrappedRamTensor({10, 10}), input_buffer, i8);
model.tensors.tensor2 = RamTensor({10, 10}), result, i32);  //output
model.run();

print(result[0]);

//do something with input_buffer
model.run();
print(result[0]);

@mbartling Thoughts? One issue I have is that tensors cannot be created before the model, unless we want to explicitly instantize the context and allocators. And, what would be a good way to keep the input/output tensors alive? Maybe create a utility tensor-factory class for initializing the context and alloc classes? The purpose of the tensor-factory is mainly for syntax sugaring, making things more approachable for the hobbyist communities.

@dboyliao visibility for code-gen @Knight-X

mbartling commented 4 years ago

Just as an FYI my brain is totally dedicated to the rearch right now so I might be misreading your concerns.

The primary issue here is where do the meta-data allocator and RAM data allocators live, or if they are separate entities at all.

Maybe create a utility tensor-factory class for initializing the context and alloc classes?

This is the job of the model class, either at construction or at model run.

And, what would be a good way to keep the input/output tensors alive?

Honestly I am in favor of the user requesting references to input/output lists contained by the model itself. This way we are less prone to dealing with the user providing invalid input tensors. I imagine input tensors would be a fixed type of tensor specialization (or tensor handle) that can provide some compile time guarantees.