ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.12k stars 16.19k forks source link

How to modify Detect layer to allow for converting yolov5 to Qualcomm's SNPE format? #4790

Closed evdoks closed 2 years ago

evdoks commented 3 years ago

❔Question

I am trying to convert a trained yolov5s model to an SNPE format in order to be able to run it on a Snapdragon chip. Unfortunately, Qualcomm's ONNX to SNPE converter fails on the Detect level with the following error message

ValueError: Unable to permute shape [1, 3, 64, 64, 2] to NSC ordering
2021-09-14 15:15:37,327 - 183 - ERROR - Node Mul_268: Unable to permute shape [1, 3, 64, 64, 2] to NSC ordering

I can imagine, it may have something to do with the fact that SNPE currently supports 4D input data, where the first dimension is batch SNPE doc and yolov5 Detect layer has 5D reshape.

Would it be possible to modify Detect layer so that no 5D reshape is performed?

JISHNUSHAJI commented 2 years ago

I have converted the yolov5 model to dlc,now i have to do the 5d reshape and other post processing outside the model.Could someone share the code for post processing from 5d reshape onwards?

eeyzl5 commented 2 years ago

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

glenn-jocher commented 2 years ago

@eeyzl5 awesome, thanks for the detailed feedback!

JISHNUSHAJI commented 2 years ago

@eeyzl5 thanks for the detailed explanation

hansoullee20 commented 2 years ago

@eeyzl5

Thank you for sharing the details with us.

I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/44 [00:02<?, ?it/s]
Traceback (most recent call last): File "./yolov5/train.py", line 643, in main(opt) File "./yolov5/train.py", line 539, in main train(opt.hyp, opt, device, callbacks) File "./yolov5/train.py", line 330, in train pred = model(imgs) # forward File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/tmp/yolov5/models/yolo.py", line 128, in forward return self._forward_once(x, profile, visualize) # single-scale inference, train File "/root/tmp/yolov5/models/yolo.py", line 151, in _forward_once x = m(x) # run File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/tmp/yolov5/models/yolo.py", line 55, in forward x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified RuntimeError: shape '[3, 6, 80, 80]' is invalid for input of size 921600

Have you made any other modifications you haven't shared with us?

Thank you for reading.

eeyzl5 commented 2 years ago

@eeyzl5

Thank you for sharing the details with us.

I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/44 [00:02<?, ?it/s] Traceback (most recent call last): File "./yolov5/train.py", line 643, in main(opt) File "./yolov5/train.py", line 539, in main train(opt.hyp, opt, device, callbacks) File "./yolov5/train.py", line 330, in train pred = model(imgs) # forward File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/tmp/yolov5/models/yolo.py", line 128, in forward return self._forward_once(x, profile, visualize) # single-scale inference, train File "/root/tmp/yolov5/models/yolo.py", line 151, in _forward_once x = m(x) # run File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/tmp/yolov5/models/yolo.py", line 55, in forward x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified RuntimeError: shape '[3, 6, 80, 80]' is invalid for input of size 921600

Have you made any other modifications you haven't shared with us?

Thank you for reading.

@hansoullee20

Hi, just to remind that the modifications are only valid in deployment steps after you've already got a trained model and you wish to export this model to SNPE compatible format. So you just keep the original code while training, then apply the modifications when you gonna export to onnx format (refer to this), and then export to dlc format.

hansoullee20 commented 2 years ago

@eeyzl5

Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past?

When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised.

If you have any recommendations would be much appriciated. Than you in advance.

ravineti commented 2 years ago

Hi - Any reference implementation to integrate YoloV5 in Android APP using SNPE ?

We are able to successfully convert the DLC, and run on Snapdragon device, using ARM CPU, GPU, and DSP runtimes. However, looking for any pre-post processing reference code in JNI or Java ?

eeyzl5 commented 2 years ago

@eeyzl5

Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past?

When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised.

If you have any recommendations would be much appriciated. Than you in advance.

@hansoullee20

Hi, I was able to get correct detections. If you try to run with DSP please refer to "Running with DSP" section from my above comment. Otherwise you may not get the correct result especially if you don't do post-processing on cpu. Post-processing includes all the operations after 5d reshape. Again you may refer to my sample codes. My suggestion is to start with the default official model and compare the raw values output from PC and your snpe device.

rszeto-sy commented 2 years ago

I ran into a problem running the code from @eeyzl5's detailed answer (linked here for brevity) but found a likely solution. There's a bug in the "Running with DSP section" where, in the final if statement, the grid location and anchors are not set correctly. The posted version only subtracts num_filters[1] from ci, whereas it should subtract (num_filters[1] + num_filters[0]) so that all grid locations and anchors are sampled correctly. This is what the final if statement should look like:

gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
stride = 32;

This is what it looks like inside the entire code snippet:

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

And since I happened to need this in Python, here's that too in case it's useful (it returns a copy instead of operating in-place):

def postprocess_raw_output(
        values,
        anchorX=[10,16,33,30,62,59,116,156,373],
        anchorY=[13,30,23,61,45,119,90,198,326],
        num_filters=[19200,4800,1200],
        filter_size=[80,40,20],
        last_dim_size=85
    ):

    ret = values.copy()

    for c in range(4, values.size, last_dim_size):
        cx = values[c-4]
        cy = values[c-3]
        w = values[c-2]
        h = values[c-1]

        ci = int(c / last_dim_size)
        if ci < num_filters[0]:
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0]
            gridY = int((ci%(filter_size[0]*filter_size[0]))/filter_size[0])
            anchor_gridX = anchorX[int(ci/(filter_size[0]*filter_size[0]))]
            anchor_gridY = anchorY[int(ci/(filter_size[0]*filter_size[0]))]
            stride = 8
        elif ci>=num_filters[0] and ci<(num_filters[0]+num_filters[1]):
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1]
            gridY = int(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1])
            anchor_gridX = anchorX[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
            anchor_gridY = anchorY[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
            stride = 16
        else:
            gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2]
            gridY = int(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2])
            anchor_gridX = anchorX[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
            anchor_gridY = anchorY[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
            stride = 32

        cx = float(cx*2-0.5+gridX)*stride
        cy = float(cy*2-0.5+gridY)*stride
        w = w*2*w*2*anchor_gridX
        h = h*2*h*2*anchor_gridY
        ret[c-4:c] = [cx, cy, w, h]

    return ret

Hopefully this is of use particularly to @hansoullee20.

fwzdev1 commented 1 year ago

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

Thank you so much for such a detailed reply...or even a report!

I was doing exactly the same thing as you did, which is to detect a single class of objects smaller than 10 10. What I did for the network is almost the same as you explained, except changing silu to leakyrelu (I just make it relu instead). I've tested yolov5n with backbone scale = 0.2 or 0.3, and it took about 8-10 for networks in DSP (snapdragon 855, 640640). I finally choose nanodet plus with the same custom changings for more convenient pre and post-process codes (the repository of nanodet contains official pre and post-process code for SNPE).

But there is another problem. It took me about 10ms for network inference, which is acceptable. When it comes to pre and post-process, things changed. I remember it was about 3ms for pre and 4ms for post, so the total time was over 17ms for a single image. Compare with using CPU(1 ms pre, 1ms post), it took 2 or 3 ms more on data transmission and computing (in OpenCV) using DSP.

I wonder whether you had bothered by this issue or not.

Finally, great appreciation for your sharing!

saadwaraich1 commented 1 year ago

Hi all,

I convert yolov5n.pt --imgsz 320 to yolov5n.onnx --imgsz 192 320 (the concept of letterbox), and then use SNPE 1.58 to convert yolov5n.dlc --imgsz 192 320(the concept of letterbox).

I use "gst-launch-1.0 / qtimlesnpe " to parse yolov5n.dlc to demo, the detection effect is very good, close to lossless conversion.

image: gnome-shell-screenshot-LQ5XH1

video: https://drive.google.com/file/d/1-eEi8dkh_3mLxd3CEnRpqPFLJJ4G5FPH/view?usp=sharing

Hey, thanks for the help. I am able to convert they way you mentioned. I am trying to demo using gstreamer and qtimlesnpe. I can see model running as pipeline is taking some time, but no bounding boxes on the video. I have seen this behavior before and going to a previous version of libqtioverlay.so solved the issue. Not sure how it solved but it worked. @jayer95 @Mohit-Ak Any idea how I can deal with it or maybe if you can somehow find what version libqtioverlay.so library was used on your end ? Thanks

jayer95 commented 1 year ago

@saadwaraich1 Hi, thank you for your reply,

The codes of libqtioverlay.so and other qtimlesnpe plugins need to be rewritten and covered. The main thing is to write the code for parsing the 4D format output of yolov5.

Are you a Qualcomm chip buyer? Please contact Qualcomm’s customer support directly and ask Qualcomm’s technical staff by raising a case.

wofvh commented 1 year ago

So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ?

hansoullee20 commented 1 year ago

μ•ˆλ…•ν•˜μ„Έμš”

μ–΄λ””κΉŒμ§€ μ‹€ν–‰ν•΄ λ³΄μ…¨λ‚˜μš”? μœ„ μŠ€λ ˆλ“œμ—μ„œ ꡬ체적으둜 이해가 μ•ˆκ°€μ‹œλŠ” 뢀뢄이 μžˆμœΌμ‹ κ°€μš”?

On Thu, May 11, 2023, 11:06 AM teddy @.***> wrote:

So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 https://github.com/hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ?

β€” Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov5/issues/4790#issuecomment-1543205606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA . You are receiving this because you were mentioned.Message ID: @.***>

glenn-jocher commented 1 year ago

@wofvh μ•ˆλ…•ν•˜μ„Έμš”,

ν•΄λ‹Ή 정보에 λŒ€ν•΄ κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€. μ œκ°€ μ΄ν•΄ν•˜κΈ°λ‘œλŠ” yolov5κ°€ SNPE와 ν˜Έν™˜λ˜μ–΄ λ¬Έμ œκ°€ ν•΄κ²°λœ 것 κ°™μŠ΅λ‹ˆλ‹€. μ–΄λ–»κ²Œ 진행해야 ν•˜λŠ”μ§€μ— λŒ€ν•΄ 좔가적인 정보λ₯Ό κ³΅μœ ν•΄μ£Όμ‹œλ©΄ κ°μ‚¬ν•˜κ² μŠ΅λ‹ˆλ‹€.

κ°μ‚¬ν•©λ‹ˆλ‹€.

wofvh commented 1 year ago

μ•ˆλ…•ν•˜μ„Έμš” μ–΄λ””κΉŒμ§€ μ‹€ν–‰ν•΄ λ³΄μ…¨λ‚˜μš”? μœ„ μŠ€λ ˆλ“œμ—μ„œ ꡬ체적으둜 이해가 μ•ˆκ°€μ‹œλŠ” 뢀뢄이 μžˆμœΌμ‹ κ°€μš”? … On Thu, May 11, 2023, 11:06 AM teddy @.> wrote: So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 https://github.com/hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ? β€” Reply to this email directly, view it on GitHub <#4790 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA . You are receiving this because you were mentioned.Message ID: @.>

@hansoullee20 닡변정말 κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€ μ§€κΈˆ ν˜„μž¬ qualcomm νŠœν† λ¦¬μ–Όλ³΄κ³  λ¦¬λˆ…μŠ€ 18.04 _x86_64 μ—μ„œ inceptionv3 λͺ¨λΈμ„ dlc 둜 λ°”κΎΈκ³  λ‹€μ‹œ μ–‘μžν™” ν•΄μ„œ 무게λ₯Ό μ€„μ΄λŠ”κ±° κΉŒμ§€λŠ” μ™„λ£Œ ν–ˆμŠ΅λ‹ˆλ‹€ μ§€κΈˆ μ €μ—κ²Œ yolov5 .onnx κ°€μ€‘μΉ˜κ°€ μžˆλŠ”λ° 이걸 snpe-sdk μ—μ„œ μ–΄λ–»κ²Œ dlc둜 λ°”κΏ”μ„œ μŠ€λ„΅λ“œλ ˆκ³€μ΄ μ•„λ‹Œ qualcomm RB5에 μ‹€ν–‰μ‹œν‚¬μˆ˜μžˆλŠ”μ§€ κΆκΈˆν•©λ‹ˆλ‹€ ! λ„μ™€μ£Όμ‹œλ©΄ 정말 κ°μ‚¬ν•˜κ² μŠ΅λ‹ˆλ‹€

glenn-jocher commented 1 year ago

@hansoullee20 μ•ˆλ…•ν•˜μ„Έμš”,

쒋은 μ†Œμ‹μž…λ‹ˆλ‹€. yolov5κ°€ SNPE와 ν˜Έν™˜λ˜μ–΄ λ¬Έμ œκ°€ ν•΄κ²°λœ κ²ƒμœΌλ‘œ λ³΄μž…λ‹ˆλ‹€. ν•˜μ§€λ§Œ, μ œκ°€ 직접 ꡬ체적으둜 μ‹€ν–‰ν•˜μ§€λŠ” μ•Šμ•˜κΈ° λ•Œλ¬Έμ— μ–΄λ–»κ²Œ 진행해야 ν•˜λŠ”μ§€μ— λŒ€ν•΄μ„œλŠ” μ •ν™•ν•œ 정보λ₯Ό μ œκ³΅λ“œλ¦¬κΈ° μ–΄λ €μšΈ 것 κ°™μŠ΅λ‹ˆλ‹€.

좔가적인 정보가 ν•„μš”ν•˜μ‹œλ‹€λ©΄, μ½”λ“œλ₯Ό λΆ„μ„ν•˜κ±°λ‚˜ 질문과 κ΄€λ ¨λœ λ¬Έμ„œλ₯Ό μ°Έκ³ ν•˜μ‹œλŠ” 것이 쒋을 κ²ƒμž…λ‹ˆλ‹€. μ’€ 더 ꡬ체적인 질문이 μžˆλ‹€λ©΄ μ–Έμ œλ“ μ§€ λ¬Έμ˜ν•΄ μ£Όμ„Έμš”.

κ°μ‚¬ν•©λ‹ˆλ‹€.

hansoullee20 commented 1 year ago

qualcomm RB5μ—μ„œ μ‹€ν–‰μ‹œν‚¬μˆ˜μžˆλŠ”μ§€λŠ” 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€. μ–‘μžν™” κΉŒμ§€ μ™„λ£Œ ν•˜μ…§λ‹€λ©΄ ν˜Ήμ‹œ 5D μ—μ„œ 4Dλ³€ν™˜ν›„ DLC μƒμ„±ν•˜κ³  μ–‘μžν™” ν•˜μ…§λ‚˜μš”? ν˜Ήμ€ onnxμ—μ„œ dlc둜 λ³€ν™˜ν•˜μ‹œλŠ”κ²Œ κΆκΈˆν•˜μ‹ κ±΄κ°€μš”?

On Thu, May 11, 2023, 12:04 PM teddy @.***> wrote:

μ•ˆλ…•ν•˜μ„Έμš” μ–΄λ””κΉŒμ§€ μ‹€ν–‰ν•΄ λ³΄μ…¨λ‚˜μš”? μœ„ μŠ€λ ˆλ“œμ—μ„œ ꡬ체적으둜 이해가 μ•ˆκ°€μ‹œλŠ” 뢀뢄이 μžˆμœΌμ‹ κ°€μš”? … <#m_-356277837338867266m-2144071634188085371_> On Thu, May 11, 2023, 11:06 AM teddy @.> wrote: So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 https://github.com/hansoullee20 https://github.com/hansoullee20 https://github.com/hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ? β€” Reply to this email directly, view it on GitHub <#4790 (comment) https://github.com/ultralytics/yolov5/issues/4790#issuecomment-1543205606>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA . You are receiving this because you were mentioned.Message ID: @.>

@hansoullee20 https://github.com/hansoullee20 닡변정말 κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€ μ§€κΈˆ ν˜„μž¬ qualcomm νŠœν† λ¦¬μ–Όλ³΄κ³  λ¦¬λˆ…μŠ€ 18.04 _x86_64 μ—μ„œ inceptionv3 λͺ¨λΈμ„ dlc 둜 λ°”κΎΈκ³  λ‹€μ‹œ μ–‘μžν™” ν•΄μ„œ 무게λ₯Ό μ€„μ΄λŠ”κ±° κΉŒμ§€λŠ” μ™„λ£Œ ν–ˆκ³  μ €μ—κ²Œ yolov5 .onnx κ°€μ€‘μΉ˜κ°€ μžˆλŠ”λ° 이걸 snpe-sdk μ—μ„œ μ–΄λ–»κ²Œ dlc둜 λ°”κΏ”μ„œ μŠ€λ„΅λ“œλ ˆκ³€μ΄ μ•„λ‹Œ qualcomm RB5에 μ‹€ν–‰μ‹œν‚¬μˆ˜μžˆλŠ”μ§€ κΆκΈˆν•©λ‹ˆλ‹€ !

β€” Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov5/issues/4790#issuecomment-1543264270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFPXBXOYA62GHITMXLXFRJMVANCNFSM5EAN3ODA . You are receiving this because you were mentioned.Message ID: @.***>

wofvh commented 1 year ago

@hansoullee20 qualcomm sdk νŽ˜μ΄μ§€μ— λ‚˜μ™€μžˆλŠ” 기본적인 λ°©λ²•μœΌλ‘œ inception_v3λͺ¨λΈμ„ 5Dμ—μ„œ 4D λ³€ν™˜ν›„ μ–‘μžν™” κΉŒμ§€λŠ” ν–ˆλŠ”λ° YOLOV5κ°€μ€‘μΉ˜λŠ” μ–΄λ–»κ²Œ dlc파일둜 λ³€ν™˜ν›„ μ–‘μžν™”λŠ”λ²•μ„ 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€ μ œκ°€ 이μͺ½μ— λŒ€ν•΄ 많이 λΆ€μ‘±ν•΄μ„œ 쑰금 μ‰½κ²Œ μ•Œλ €μ£Όμ‹œλ©΄ κ°μ‚¬ν•˜κ² μŠ΅λ‹ˆλ‹€ onnx νŒŒμΌκΉŒμ§„ λ³€ν™˜μ΄ λ˜μ–΄μžˆλŠ” μƒνƒœμž…λ‹ˆλ‹€ κ°μ‚¬ν•©λ‹ˆλ‹€

ν˜„μž¬ μš°λΆ„νˆ¬ 18.04 파이썬 3.6.9

glenn-jocher commented 1 year ago

Hello @wofvh,

I have been following the Qualcomm SDK page and was able to convert the Inception_v3 model from 5D to 4D and perform quantization successfully. However, I am having difficulty with converting YOLOv5 weights to a dlc file and then performing quantization. As I am relatively new to this topic, I was hoping you could provide some guidance on how to proceed with these steps in a relatively easy-to-understand manner. Currently, the onnx file has been converted already. Thank you for your help.

Best,

aleshem commented 7 months ago

Hi Glenn I managed to convert to dlc and run after quantization I put my changes in this fork yolov5_snpe_conversion However, I have been having some trouble quantizing the network using real images. the results look worse than having zeros as the input bin files. Does anyone have any experience with this and know what would be good practices for this conversion? Thanks

glenn-jocher commented 7 months ago

@aleshem hi there,

It's great to hear that you've managed to convert to dlc and run after quantization. Regarding the issues you're facing with quantizing the network using real images, it's not uncommon to encounter challenges during this process. Quantization can sometimes lead to a degradation in model performance, especially if the quantization process or the selection of calibration images is not optimal.

Here are a few general tips that might help improve the quantization results:

  1. Calibration Dataset: Ensure that the dataset used for calibration is representative of the actual use case and diverse enough to cover various scenarios the model will encounter.
  2. Quantization Strategy: Experiment with different quantization strategies. For instance, symmetric vs. asymmetric quantization, per-channel vs. per-tensor quantization, and so on.
  3. Model Fine-tuning: After quantization, it might be beneficial to fine-tune the model with a small learning rate for a few epochs to regain some of the lost accuracy.
  4. Quantization-aware Training: If possible, consider quantization-aware training, where the model is trained with simulated quantization, making it more robust to the effects of quantization.

Unfortunately, without more specific details, it's challenging to provide more targeted advice. I recommend reviewing the documentation and resources available for the specific quantization tools you're using, as they might offer insights or best practices specific to their methodology.

Remember, the community and forums dedicated to the specific tools or frameworks you're using can also be valuable resources for advice and troubleshooting.

Best of luck with your quantization efforts, and feel free to reach out if you have more specific questions or issues.

Best regards.

BaoHaoo commented 5 months ago

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

Hello, thank you very much for your detailed report. Based on your work, I further tested the quantized YOLOv5 model running on the Qualcomm SNPE DSP. I found that in the final output tensor, the object detection boxes could be detected properly, but strangely their confidence scores (which is y[4:6] in the code below) were very low. Combining with SNPE's quantization tool "snpe-dlc-quantize", I speculate that this is because SNPE adopts a rather basic method of post-training quantization, which is 𝑄 = round((FP32 βˆ’ Min)/Scale Factor) + Zero Point.

if self.inplace:
     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
 else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Before the code block, xy, wh, and conf are tensors ranging from 0 to 1, so their quantization does not cause much loss of accuracy. However, after this code block, xy scales to a range approximately equal to the size of the image, wh scales to the size of the target in pixel range, but only confidence remains within the range of 0 to 1.

After this code block, all these variables including xy, wh, and conf are concatenated into one vector. For SNPE, one vector shares one quantization parameter. During quantization, the quantization scale is equivalent to the tensor with the largest change in value. Corresponding to xy, wh, and conf, the xy with the largest change will be used as the quantization scale. This scale will cause severe loss in the range of variables, and the confidence will tend to zero. This is also the reason why there is no output after quantization. For example, for an image of size 640x640, using Int8 quantization, the Scale Factor is 640/256 = 2.5, which is even larger than the entire range of conf. This is the fundamental reason for the significant loss of quantization accuracy.

To address this issue, the strategy I employed here is a very naive one: I multiply the confidence scores (conf) by a coefficient to scale them along with xy and wh to the same range. This prevents excessive loss of precision during quantization. After the final model output, I divide it by the same coefficient to obtain normal detection confidence.

 if self.inplace:
     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
 else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
     conf = y[..., 4:6]*1280 # 1280 is the max length of the input size, which can be the width of the input image
     y = torch.cat((xy, wh, conf), -1)

By using this method, adjustments only need to be made to the confidence scores at the end, without the need to add additional code elsewhere. Through testing, I found that the quantization accuracy loss resulting from this approach is acceptable.

glenn-jocher commented 5 months ago

Hello,

Thanks for sharing your exploration and solution to the quantization accuracy loss issue when running YOLOv5 with SNPE. It's insightful to see how adjusting the range of the confidence scores can help mitigate precision loss due to quantization. Your approach to scale the confidence scores to be in line with other tensor ranges and then scaling back for final output is a clever workaround. This strategy could be beneficial for others facing similar quantization challenges.

It's always exciting to see community members contributing novel solutions to complex problems. Keep up the good work, and thank you for contributing to the broader knowledge base around YOLOv5 and SNPE!

aleshem commented 5 months ago

BTW, I managed to convert yolov8 using SNPE 2.10 (opset=12 in export) without any major changes. in case your chip supports this version it may save you a lot of time

glenn-jocher commented 5 months ago

Hey there!

That's fantastic news! πŸŽ‰ It's always great to hear about smooth conversions, especially with newer versions like YOLOv8 and SNPE 2.10. Your sharing could indeed save a lot of time for many in the community working on similar projects. If there are specific steps or minor tweaks that helped you along the way, feel free to drop those details. Every bit helps! Thanks for sharing, and happy coding! πŸ‘

aleshem commented 5 months ago

The major problem at the moment, is that the quantization doesn't work well for yolov8-pose, for some reason it ruins the confidence

glenn-jocher commented 5 months ago

Hello!

Thanks for reaching out about the quantization issue with YOLOv8-pose. It's not uncommon for quantization to affect model confidence, as precision loss can significantly impact the network's output. πŸ€”

A potential approach is to experiment with different quantization techniques or tools that might offer better control over precision loss. Considering calibration datasets that closely represent your use case might also help mitigate this issue. It's all about finding the right balance for your specific scenario.

If this doesn't resolve the problem, could you share more details about the quantization method you're using? This info might provide further insights for troubleshooting.

Best regards!