Hi,
I have implemented Rocket64b1gem16 on my FPGA with default configs and 8GiB DDR3.
The ONNX Resnet18 Model sometimes can run with command '-O 99' and I can get the right result. But sometimes it gets stuck.
With the optimizing command '-O 1' , the model can run every time but it takes more time.
Besides, chipyard spike simulator can always run this model with '-O 1' and '-O 99' correctly.
The program always runs correctly on Rocket64b1gem8.
Here are the compared results.
Below is rocket64b1gem16 with '-O 99' result. This model can run correctly with '-O 99' occasionally.
debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem16 -1 detection_quanV2.onnx -i images/2.jpg -x 2 -O 99
Loaded runner program
Using systolic in mode 2
Using Onnxruntime C++ API
Number of inputs = 1
Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ]
Number of outputs = 12
Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ]
Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ]
Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ]
Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ]
Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ]
Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ]
Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ]
Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ]
Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ]
Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ]
Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ]
Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ]
Number of inputs = 1
Input 0 : name=input.1, type=1, num_dims=4: [1, 3, 256, 256, ]
Number of outputs = 1
Output 0 : name=231, type=1, num_dims=4: [1, 21, 64, 64, ]
yolox init
pose init
Loading image
Image dimensions: 256 256 3
Called into systolic conv
Using systolic pooling
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
0
11
2
0
0
0
84 140 207 264 0
resize took 0 cycles 5.487418 s
normalize_transpose took 0 cycles 3.011997 s
Done! Pre Process 1 took 0 cycles 8.499517 s
Done! Inference 1 took 0 cycles 5.220877 s
Done! Pre Process 1 took 0 cycles 1.010774 s
Below is rocket64b1gem16 with '-O 99' stuck result. This model sometimes gets stuck at the same place.
debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem16 -1 detection_quanV2.onnx -2 pose_quanV2.onnx -i images/2.jpg -x 2 -O 99
Loaded runner program
Using systolic in mode 2
Using Onnxruntime C++ API
Number of inputs = 1
Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ]
Number of outputs = 12
Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ]
Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ]
Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ]
Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ]
Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ]
Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ]
Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ]
Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ]
Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ]
Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ]
Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ]
Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ]
yolox init
Loading image
Image dimensions: 256 256 3
Called into systolic conv
Using systolic pooling
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic
Below is rocket64b1gem16 with '-O 1' result. This model can run correctly with '-O 1'.
debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem16 -1 detection_quanV2.onnx -i images/2.jpg -x 2 -O 1
Loaded runner program
Using systolic in mode 2
Using Onnxruntime C++ API
Number of inputs = 1
Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ]
Number of outputs = 12
Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ]
Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ]
Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ]
Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ]
Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ]
Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ]
Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ]
Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ]
Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ]
Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ]
Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ]
Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ]
yolox init
Loading image
Image dimensions: 256 256 3
Called into systolic matmul!
Using accelerated matmul with dimensions (16, 25600, 147)
Called into systolic matmul!
Using accelerated matmul with dimensions (16, 6400, 144)
Called into systolic matmul!
Using accelerated matmul with dimensions (16, 6400, 144)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (16, 6400, 144)
Called into systolic matmul!
Using accelerated matmul with dimensions (16, 6400, 144)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (32, 1600, 16)
Called into systolic matmul!
Using accelerated matmul with dimensions (32, 1600, 144)
Called into systolic matmul!
Using accelerated matmul with dimensions (32, 1600, 288)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (32, 1600, 288)
Called into systolic matmul!
Using accelerated matmul with dimensions (32, 1600, 288)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 400, 32)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 400, 288)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 400, 576)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 400, 576)
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 400, 576)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 100, 64)
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 100, 576)
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 100, 1152)
Called into systolic add
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 100, 1152)
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 100, 1152)
Called into systolic add
1x1 case!
Called into systolic matmul!
Using accelerated matmul with dimensions (256, 100, 128)
Called into systolic matmul!
Using accelerated matmul with dimensions (512, 25, 2304)
1x1 case!
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 25, 512)
Called into systolic matmul!
Using accelerated matmul with dimensions (256, 9, 1152)
1x1 case!
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 9, 256)
Called into systolic matmul!
Using accelerated matmul with dimensions (256, 4, 1152)
1x1 case!
Called into systolic matmul!
Using accelerated matmul with dimensions (64, 4, 256)
Called into systolic matmul!
Using accelerated matmul with dimensions (128, 1, 576)
Called into systolic matmul!
Using accelerated matmul with dimensions (24, 1, 1152)
Called into systolic matmul!
Using accelerated matmul with dimensions (24, 4, 2304)
Called into systolic matmul!
Using accelerated matmul with dimensions (24, 9, 2304)
Called into systolic matmul!
Using accelerated matmul with dimensions (24, 25, 4608)
Called into systolic matmul!
Using accelerated matmul with dimensions (24, 100, 1152)
Called into systolic matmul!
Using accelerated matmul with dimensions (24, 400, 576)
Called into systolic matmul!
Using accelerated matmul with dimensions (12, 1, 1152)
Called into systolic matmul!
Using accelerated matmul with dimensions (12, 4, 2304)
Called into systolic matmul!
Using accelerated matmul with dimensions (12, 9, 2304)
Called into systolic matmul!
Using accelerated matmul with dimensions (12, 25, 4608)
Called into systolic matmul!
Using accelerated matmul with dimensions (12, 100, 1152)
Called into systolic matmul!
Using accelerated matmul with dimensions (12, 400, 576)
0
11
2
0
0
0
84 140 207 264 0
resize took 0 cycles 5.440022 s
normalize_transpose took 0 cycles 2.139706 s
Done! Pre Process 1 took 0 cycles 7.579837 s
Done! Inference 1 took 0 cycles 17.962803 s
Done! Pre Process 1 took 0 cycles 1.224211 s
I also tried to run this model on Rocket64b1gem8. This model always runs correctly with '-O 99', and it's inference time is much shorter than gem16 which is weird.
Below is rocket64b1gem8 with '-O 99' result.
debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem8 -1 detection_quanV2.onnx -i images/2.jpg -x 2 -O 99
Loaded runner program
Using systolic in mode 2
Using Onnxruntime C++ API
Number of inputs = 1
Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ]
Number of outputs = 12
Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ]
Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ]
Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ]
Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ]
Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ]
Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ]
Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ]
Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ]
Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ]
Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ]
Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ]
Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ]
yolox init
Loading image
Image dimensions: 256 256 3
Called into systolic conv
Using systolic pooling
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic add
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
Called into systolic conv
0
11
2
0
0
0
84 140 207 264 0
resize took 0 cycles 1.830045 s
normalize_transpose took 0 cycles 1.073210 s
Done! Pre Process 1 took 0 cycles 2.903357 s
Done! Inference 1 took 0 cycles 1.933709 s
Done! Pre Process 1 took 0 cycles 0.445910 s
I also changed DDR to 2Gib DDR3, which I get the same result and the model gets stuck at the same place.
What might be the problem?
Thanks!
Hi, I have implemented Rocket64b1gem16 on my FPGA with default configs and 8GiB DDR3. The ONNX Resnet18 Model sometimes can run with command '-O 99' and I can get the right result. But sometimes it gets stuck. With the optimizing command '-O 1' , the model can run every time but it takes more time. Besides, chipyard spike simulator can always run this model with '-O 1' and '-O 99' correctly. The program always runs correctly on Rocket64b1gem8. Here are the compared results.
Below is rocket64b1gem16 with '-O 99' result. This model can run correctly with '-O 99' occasionally.
Below is rocket64b1gem16 with '-O 99' stuck result. This model sometimes gets stuck at the same place.
Below is rocket64b1gem16 with '-O 1' result. This model can run correctly with '-O 1'.
I also tried to run this model on Rocket64b1gem8. This model always runs correctly with '-O 99', and it's inference time is much shorter than gem16 which is weird. Below is rocket64b1gem8 with '-O 99' result.
I also changed DDR to 2Gib DDR3, which I get the same result and the model gets stuck at the same place. What might be the problem? Thanks!