Running custom models on up xtreme + MYRIAD VPU

mszuyx commented 2 years ago

Hi,

Great project first of all! Big fan of the project!

I am currently working on deploying a custom model on a single board computer called up xtreme: https://up-board.org/up-xtreme/ (the i7 version). The model itself can be converted into intel IR from ONNX and be executed via openvino runtime on our desktop.

Our questions are:

As the data acquisition is done using an intel RGB-D camera D455, we figure a purely intel setup would make things easier: D455 + UP xtreme + NCS2 / UP VISION PLUS X (https://up-shop.org/up-vision-plus-x.html). We are not sure if we should purchase inference accelerators since we don't know how to confirm if our model is compatible with NCS2 / MYRIAD devices. Is there a test we can perform to check that? Or can they run any model as long as they can be converted into intel IR?
We have some CPU intensive programs that need to be run along with the model inference, would the addition of inference accelerators "off load" the burden of inference from CPU?
We did our development and benchmark on desktop, but we would like to deploy the model on the single board computer and run along side with other ROS package. What is a recommended workflow? Shall we install openvino library similarly to what we did on the desktop, and then ros-openvino? Or the ros-openvino can handle IR files directly such that a full installation of openvino is not necessary? (we want to keep the env on the single board computer as light as possible)

Thanks! : )

Iffa-Intel commented 2 years ago

Hi @mszuyx ,

Generally, VPU/MYRIAD/NCS2 could work as long as you fulfill the OpenVINO system requirements (software and hardware) and if the model & its topology is officially supported. The VPU plugin should work as expected.

However, bear in mind that the model IR must be in FP16 format in order to be inference with VPU. For testing purposes, you can convert the model into FP16 IR using Model Optimizer by feeding the parameter --data_type {FP16,FP32,half,float} For example: python3 mo_tf.py --input_model alexnet.xml --output_dir --data_type FP16

Next, infer it with the Benchmark Tool with -d MYRIAD parameter: python3 benchmark_app.py -m -i -d MYRIAD

Another option to test your model is by using DevCloud. You may access it here.

The computing power of VPU is not as powerful if compared with CPU but in your use case, this device would definitely help to offload some burden from CPU. You can consider using the OpenVINO Multi-Device plugin where it automatically assigns inference requests to available computational devices to execute the requests in parallel. This cause the devices to share the inference burden and results in consistent performance.

Meanwhile, for deployment purposes, you can use the OpenVINO Deployment Manager. This would create a runtime package (deployment package) for your target device by assembling the model, IR files, your application, and associated dependencies.

mszuyx commented 2 years ago

@Iffa-Meah Thank you! These are all very helpful!

So the takeaway messages are: (correct me if I got them wrong?)

If my current setup has no problem converting, running the IR model using Openvino, then it is most likely ok with VPU except for the FP16 issue. (I have tried CPU, and integrated GPU so far, no issue at all)
I shall verify the model accuracy under FP16 first to see if it is acceptable.
Use benchmark tool to verify performance (FP32 vs. 16, CPU vs. GPU vs. VPU ...)
Check out Multi-Device plugin once VPU arrives.
ros-openvino might not be the only option, the Openvino deployment manager can package everything into a portable format.

The other thing I noticed is, if I select GPU as the inference device (I am on Ubuntu 20.04 and python 3.8 API), it seems to take a much longer time for the model to initialize (I am guessing the CPU is busy loading the model into GPU memory? Because I see the CPU usage during this period is almost 100%). Is this expected?

Also, I notice if I build a naive benchmark test (use a for loop to test the average inference FPS), the result is inconsistent on the single-board computer (the larger iteration number I choose, the smaller the FPS). When I test with 30 iters, the FPS is 44, but setting it to 120 will sometimes give me 20 Hz. But this is not an issue for the desktop, which has better single-core processing power and way better multi-core power. I suspect the longer the test time, the more likely CPU would get distracted from other tasks? Would changing my code from python API to C++ gives the model higher priority? Or this theory is simply not true?

Thanks!

brmarkus commented 2 years ago

[...] The other thing I noticed is, if I select GPU as the inference device (I am on Ubuntu 20.04 and python 3.8 API), it seems to take a much longer time for the model to initialize (I am guessing the CPU is busy loading the model into GPU memory? Because I see the CPU usage during this period is almost 100%). Is this expected? [...]

The delay is because while loading/importing the model into the plugin, for GPU (and some VPUs) OpenVINO compiles OpenCL-kernels for contained tensors&operations.

With using an ENV variable like export cl_cache_dir=/my/path/to/a/folder/ those OpenCL-kernel files could be stored (temporarily or permanently) and will be loaded the next time instead of being compiled again and again.

mszuyx commented 2 years ago

[...] The other thing I noticed is, if I select GPU as the inference device (I am on Ubuntu 20.04 and python 3.8 API), it seems to take a much longer time for the model to initialize (I am guessing the CPU is busy loading the model into GPU memory? Because I see the CPU usage during this period is almost 100%). Is this expected? [...]

The delay is because while loading/importing the model into the plugin, for GPU (and some VPUs) OpenVINO compiles OpenCL-kernels for contained tensors&operations.

With using an ENV variable like export cl_cache_dir=/my/path/to/a/folder/ those OpenCL-kernel files could be stored (temporarily or permanently) and will be loaded the next time instead of being compiled again and again.

Sounds great! Is there somewhere I can read more about it?

Yu

mszuyx commented 2 years ago

I found some info here:

https://github.com/intel/compute-runtime/blob/master/opencl/doc/FAQ.md

I think the loading is much faster now!

Thx

openvinotoolkit / openvino

Running custom models on up xtreme + MYRIAD VPU #10085