openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
7.28k stars 2.27k forks source link

Running custom models on up xtreme + MYRIAD VPU #10085

Closed mszuyx closed 2 years ago

mszuyx commented 2 years ago

Hi,

Great project first of all! Big fan of the project!

I am currently working on deploying a custom model on a single board computer called up xtreme: https://up-board.org/up-xtreme/ (the i7 version). The model itself can be converted into intel IR from ONNX and be executed via openvino runtime on our desktop.

Our questions are:

Thanks! : )

Iffa-Intel commented 2 years ago

Hi @mszuyx ,

Generally, VPU/MYRIAD/NCS2 could work as long as you fulfill the OpenVINO system requirements (software and hardware) and if the model & its topology is officially supported. The VPU plugin should work as expected.

However, bear in mind that the model IR must be in FP16 format in order to be inference with VPU. For testing purposes, you can convert the model into FP16 IR using Model Optimizer by feeding the parameter --data_type {FP16,FP32,half,float} For example: python3 mo_tf.py --input_model alexnet.xml --output_dir --data_type FP16

Next, infer it with the Benchmark Tool with -d MYRIAD parameter: python3 benchmark_app.py -m -i -d MYRIAD

Another option to test your model is by using DevCloud. You may access it here.

The computing power of VPU is not as powerful if compared with CPU but in your use case, this device would definitely help to offload some burden from CPU. You can consider using the OpenVINO Multi-Device plugin where it automatically assigns inference requests to available computational devices to execute the requests in parallel. This cause the devices to share the inference burden and results in consistent performance.

Meanwhile, for deployment purposes, you can use the OpenVINO Deployment Manager. This would create a runtime package (deployment package) for your target device by assembling the model, IR files, your application, and associated dependencies.

mszuyx commented 2 years ago

@Iffa-Meah Thank you! These are all very helpful!

So the takeaway messages are: (correct me if I got them wrong?)

The other thing I noticed is, if I select GPU as the inference device (I am on Ubuntu 20.04 and python 3.8 API), it seems to take a much longer time for the model to initialize (I am guessing the CPU is busy loading the model into GPU memory? Because I see the CPU usage during this period is almost 100%). Is this expected?

Also, I notice if I build a naive benchmark test (use a for loop to test the average inference FPS), the result is inconsistent on the single-board computer (the larger iteration number I choose, the smaller the FPS). When I test with 30 iters, the FPS is 44, but setting it to 120 will sometimes give me 20 Hz. But this is not an issue for the desktop, which has better single-core processing power and way better multi-core power. I suspect the longer the test time, the more likely CPU would get distracted from other tasks? Would changing my code from python API to C++ gives the model higher priority? Or this theory is simply not true?

Thanks!

brmarkus commented 2 years ago

[...] The other thing I noticed is, if I select GPU as the inference device (I am on Ubuntu 20.04 and python 3.8 API), it seems to take a much longer time for the model to initialize (I am guessing the CPU is busy loading the model into GPU memory? Because I see the CPU usage during this period is almost 100%). Is this expected? [...]

The delay is because while loading/importing the model into the plugin, for GPU (and some VPUs) OpenVINO compiles OpenCL-kernels for contained tensors&operations.

With using an ENV variable like export cl_cache_dir=/my/path/to/a/folder/ those OpenCL-kernel files could be stored (temporarily or permanently) and will be loaded the next time instead of being compiled again and again.

mszuyx commented 2 years ago

[...] The other thing I noticed is, if I select GPU as the inference device (I am on Ubuntu 20.04 and python 3.8 API), it seems to take a much longer time for the model to initialize (I am guessing the CPU is busy loading the model into GPU memory? Because I see the CPU usage during this period is almost 100%). Is this expected? [...]

The delay is because while loading/importing the model into the plugin, for GPU (and some VPUs) OpenVINO compiles OpenCL-kernels for contained tensors&operations.

With using an ENV variable like export cl_cache_dir=/my/path/to/a/folder/ those OpenCL-kernel files could be stored (temporarily or permanently) and will be loaded the next time instead of being compiled again and again.

Sounds great! Is there somewhere I can read more about it?

Yu

mszuyx commented 2 years ago

I found some info here:

https://github.com/intel/compute-runtime/blob/master/opencl/doc/FAQ.md

I think the loading is much faster now!

Thx