Universal Inference API

neil-tan commented 4 years ago

Abstract Individual frameworks such as uTensor and TFLM have their own sets of on-device APIs. In some cases, significant boilerplate code and framework-specific knowledge are required to implement an inference task at its simplest form. A developer-friendly universal high-level inference API will be valuable for on-device ML.

On-device inferencing is generalized into these steps:

Configuration (optional)
Setting up the input
Evaluating the model
Reading the output

The code snippets below aim to illustrate the current API designs for uTensor and Tensorflow. The newly proposed API will likely utilize the code-generation technology to create an adaptor layer between the universal interface and the underlying framework-specific APIs.

Examples:

uTensor:

  Context ctx;  //creating the context class, the stage where inferences take place 
  //wrapping the input data in a tensor class
  Tensor* input_x = new WrappedRamTensor<float>({1, 784}, (float*) input_data);
  get_deep_mlp_ctx(ctx, input_x);  // pass the tensor to the context
  S_TENSOR pred_tensor = ctx.get("y_pred:0");  // getting a reference to the output tensor
  ctx.eval(); //trigger the inference

TFLM: Please refer to this hello-world example

Requirements

The newly proposed API should have high-level abstraction aims to accelerate, simplify application development, and, helps to streamline the edge-ML deployment flow, especially for resource-constrained devices.

The new API should:

Framework/tool and platform-independent
Encapsulate/abstract framework-specific boilerplate code
Provide a clear interface that enables collaboration between data scientists and embedded engineers
Prioritize developer-experience and simplicity

Proposals

Single-function-call inferencing, by @janjongboom ,


uint8_t utensor_mem_pool[4096]; // <-- CLI should tell me how much I need

utensor_something_autogenerated_init(utensor_mem_pool);

float input[33] = { 1,2,3,4 ... } float output[5];

utensor_run_something_autogenerated(input, 33, output, 5);


2.    Model object, discussion with @sandeepmistry, @mbartling  and @neil-tan  

```c++
char input_buffer[512];
int result[1];

MyModel model; //generated
model.setArenaSize(1024);
model.bind_input0(input_buffer, input_buffer_size);
model.bind_prediction0(result, 1);
model.run();

printf(“The inference result is: %d”, result[0]);

This is at its most minimal. The generated bind method names corresponding to the tensor names in the graph. The method’s signatures reflect their respective tensor-data-types. Additional methods can be implemented to support advanced configurations.

What’s Next

This issue serves as a starting point for this discussion. It will be reviewed by uTensor core-devs, Arduino, ISG data scientists, IPG engineers, and Google. We are be particular interested in reviewing use-cases which the current proposed API cannot cover. We are looking to reiterate and converge on a design in the next weeks.

neil-tan commented 4 years ago

char input_buffer[512]; ExampleTensorObject* input_tensor_obj; int result[1];

MyModel model; //generated model.setArenaSize(1024); model.bind_input0(input_buffer, shape, type); model.bind_input1(input_tensor_obj); model.bind_prediction0(result, 1); model.run();

printf(“The inference result is: %d”, result[0]);

mbartling commented 4 years ago

I think the metadata memory allocator should be fixed in size at model construction, but I am OK with the data scratchpad being on the heap.

Might look something like this: MyModel<MetaDataSize> model; model.setTensorDataMemSize(ScratchPadSize);

mbartling commented 4 years ago

template<size_t MetaDataSize=2048> 
class MyModel {
private:
  FixedTensorArenaAllocator<MetaDataSize> defaultMetaDataAllocator;
  DynamicTensorArenaAllocator defaultTensorDataAllocator;
...
};

neil-tan commented 4 years ago

We should be able to update the following draft to the re-arch without problem.

template<size_t MetaDataSize=2048> 
class MyModel {
private:
  //FixedTensorArenaAllocator<MetaDataSize> defaultMetaDataAllocator;
  //DynamicTensorArenaAllocator defaultTensorDataAllocator;

  Context& ctx;

public:

  //auto generated
  struct {
    Tensor* tensor0 = nullptr;
    Tensor* tensor1 = nullptr;
    Tensor* tensor2 = nullptr;
  } tensors;

  void run(void);

};

template<classtype T>
void copy_tensor<T>(S_TENSOR& tensor_src, S_TENSOR& tensor_dst) {
  for(size_t i = 0; i < tensor_src.getSize(); i++) {
    tensor_dst->write<T>(0, i) = *(tensor_src->read<T>(0, i));
  }
}

//auto generated
void MyModel::run(void) {
    //allocator to re-use the space in input tensors -> allow modify
    //and output tensors
    get_deep_mlp_ctx(ctx, tensors.tensor0, tensors.tensor1);

    ctx.eval();

    S_TENSOR result = ctx.get("tensor2");
    //copy the tensor out, as application should own the output memory

    copy_tensor(result, tensors.tensor2);

    ctx.gc();
  }

// Example

char input_buffer[512];
ExampleTensorObject* input_tensor_obj; //a class with Tensor interface
int result[1];

MyModel model; //generated
model.tensors.tensor0 = RamTensor({10, 10}), i8);
model.tensors.tensor1 = WrappedRamTensor({10, 10}), input_buffer, i8);
model.tensors.tensor2 = RamTensor({10, 10}), result, i32);  //output
model.run();

print(result[0]);

//do something with input_buffer
model.run();
print(result[0]);

@mbartling Thoughts? One issue I have is that tensors cannot be created before the model, unless we want to explicitly instantize the context and allocators. And, what would be a good way to keep the input/output tensors alive? Maybe create a utility tensor-factory class for initializing the context and alloc classes? The purpose of the tensor-factory is mainly for syntax sugaring, making things more approachable for the hobbyist communities.

@dboyliao visibility for code-gen @Knight-X

mbartling commented 4 years ago

Just as an FYI my brain is totally dedicated to the rearch right now so I might be misreading your concerns.

The primary issue here is where do the meta-data allocator and RAM data allocators live, or if they are separate entities at all.

Maybe create a utility tensor-factory class for initializing the context and alloc classes?

This is the job of the model class, either at construction or at model run.

And, what would be a good way to keep the input/output tensors alive?

Honestly I am in favor of the user requesting references to input/output lists contained by the model itself. This way we are less prone to dealing with the user providing invalid input tensors. I imagine input tensors would be a fixed type of tensor specialization (or tensor handle) that can provide some compile time guarantees.

uTensor / uTensor

Universal Inference API #183