ucb-bar / gemmini

Berkeley's Spatial Array Generator
Other
818 stars 170 forks source link

Running ONNX Resnet18 model gets stuck with command ‘-O 99’ #324

Open Alwinnnn opened 10 months ago

Alwinnnn commented 10 months ago

Hi, I have implemented Rocket64b1gem16 on my FPGA with default configs and 8GiB DDR3. The ONNX Resnet18 Model sometimes can run with command '-O 99' and I can get the right result. But sometimes it gets stuck. With the optimizing command '-O 1' , the model can run every time but it takes more time. Besides, chipyard spike simulator can always run this model with '-O 1' and '-O 99' correctly. The program always runs correctly on Rocket64b1gem8. Here are the compared results.

Below is rocket64b1gem16 with '-O 99' result. This model can run correctly with '-O 99' occasionally.

debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem16 -1 detection_quanV2.onnx -i images/2.jpg -x 2 -O 99 Loaded runner program Using systolic in mode 2 Using Onnxruntime C++ API Number of inputs = 1 Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ] Number of outputs = 12 Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ] Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ] Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ] Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ] Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ] Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ] Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ] Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ] Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ] Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ] Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ] Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ] Number of inputs = 1 Input 0 : name=input.1, type=1, num_dims=4: [1, 3, 256, 256, ] Number of outputs = 1 Output 0 : name=231, type=1, num_dims=4: [1, 21, 64, 64, ] yolox init pose init Loading image Image dimensions: 256 256 3 Called into systolic conv Using systolic pooling Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv 0 11 2 0 0 0 84 140 207 264 0 resize took 0 cycles 5.487418 s normalize_transpose took 0 cycles 3.011997 s Done! Pre Process 1 took 0 cycles 8.499517 s Done! Inference 1 took 0 cycles 5.220877 s Done! Pre Process 1 took 0 cycles 1.010774 s

Below is rocket64b1gem16 with '-O 99' stuck result. This model sometimes gets stuck at the same place.

debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem16 -1 detection_quanV2.onnx -2 pose_quanV2.onnx -i images/2.jpg -x 2 -O 99 Loaded runner program Using systolic in mode 2 Using Onnxruntime C++ API Number of inputs = 1 Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ] Number of outputs = 12 Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ] Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ] Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ] Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ] Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ] Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ] Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ] Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ] Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ] Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ] Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ] Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ] yolox init Loading image Image dimensions: 256 256 3 Called into systolic conv Using systolic pooling Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic

Below is rocket64b1gem16 with '-O 1' result. This model can run correctly with '-O 1'.

debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem16 -1 detection_quanV2.onnx -i images/2.jpg -x 2 -O 1 Loaded runner program Using systolic in mode 2 Using Onnxruntime C++ API Number of inputs = 1 Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ] Number of outputs = 12 Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ] Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ] Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ] Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ] Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ] Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ] Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ] Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ] Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ] Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ] Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ] Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ] yolox init Loading image Image dimensions: 256 256 3 Called into systolic matmul! Using accelerated matmul with dimensions (16, 25600, 147) Called into systolic matmul! Using accelerated matmul with dimensions (16, 6400, 144) Called into systolic matmul! Using accelerated matmul with dimensions (16, 6400, 144) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (16, 6400, 144) Called into systolic matmul! Using accelerated matmul with dimensions (16, 6400, 144) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (32, 1600, 16) Called into systolic matmul! Using accelerated matmul with dimensions (32, 1600, 144) Called into systolic matmul! Using accelerated matmul with dimensions (32, 1600, 288) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (32, 1600, 288) Called into systolic matmul! Using accelerated matmul with dimensions (32, 1600, 288) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (64, 400, 32) Called into systolic matmul! Using accelerated matmul with dimensions (64, 400, 288) Called into systolic matmul! Using accelerated matmul with dimensions (64, 400, 576) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (64, 400, 576) Called into systolic matmul! Using accelerated matmul with dimensions (64, 400, 576) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (128, 100, 64) Called into systolic matmul! Using accelerated matmul with dimensions (128, 100, 576) Called into systolic matmul! Using accelerated matmul with dimensions (128, 100, 1152) Called into systolic add Called into systolic matmul! Using accelerated matmul with dimensions (128, 100, 1152) Called into systolic matmul! Using accelerated matmul with dimensions (128, 100, 1152) Called into systolic add 1x1 case! Called into systolic matmul! Using accelerated matmul with dimensions (256, 100, 128) Called into systolic matmul! Using accelerated matmul with dimensions (512, 25, 2304) 1x1 case! Called into systolic matmul! Using accelerated matmul with dimensions (128, 25, 512) Called into systolic matmul! Using accelerated matmul with dimensions (256, 9, 1152) 1x1 case! Called into systolic matmul! Using accelerated matmul with dimensions (128, 9, 256) Called into systolic matmul! Using accelerated matmul with dimensions (256, 4, 1152) 1x1 case! Called into systolic matmul! Using accelerated matmul with dimensions (64, 4, 256) Called into systolic matmul! Using accelerated matmul with dimensions (128, 1, 576) Called into systolic matmul! Using accelerated matmul with dimensions (24, 1, 1152) Called into systolic matmul! Using accelerated matmul with dimensions (24, 4, 2304) Called into systolic matmul! Using accelerated matmul with dimensions (24, 9, 2304) Called into systolic matmul! Using accelerated matmul with dimensions (24, 25, 4608) Called into systolic matmul! Using accelerated matmul with dimensions (24, 100, 1152) Called into systolic matmul! Using accelerated matmul with dimensions (24, 400, 576) Called into systolic matmul! Using accelerated matmul with dimensions (12, 1, 1152) Called into systolic matmul! Using accelerated matmul with dimensions (12, 4, 2304) Called into systolic matmul! Using accelerated matmul with dimensions (12, 9, 2304) Called into systolic matmul! Using accelerated matmul with dimensions (12, 25, 4608) Called into systolic matmul! Using accelerated matmul with dimensions (12, 100, 1152) Called into systolic matmul! Using accelerated matmul with dimensions (12, 400, 576) 0 11 2 0 0 0 84 140 207 264 0 resize took 0 cycles 5.440022 s normalize_transpose took 0 cycles 2.139706 s Done! Pre Process 1 took 0 cycles 7.579837 s Done! Inference 1 took 0 cycles 17.962803 s Done! Pre Process 1 took 0 cycles 1.224211 s

I also tried to run this model on Rocket64b1gem8. This model always runs correctly with '-O 99', and it's inference time is much shorter than gem16 which is weird. Below is rocket64b1gem8 with '-O 99' result.

debian@debian:~/imagenet_runner_0.7.1$ ./ort_test_gem8 -1 detection_quanV2.onnx -i images/2.jpg -x 2 -O 99 Loaded runner program Using systolic in mode 2 Using Onnxruntime C++ API Number of inputs = 1 Input 0 : name=input, type=1, num_dims=4: [1, 3, 320, 320, ] Number of outputs = 12 Output 0 : name=299, type=1, num_dims=4: [1, 12, 20, 20, ] Output 1 : name=301, type=1, num_dims=4: [1, 12, 10, 10, ] Output 2 : name=303, type=1, num_dims=4: [1, 12, 5, 5, ] Output 3 : name=305, type=1, num_dims=4: [1, 12, 3, 3, ] Output 4 : name=307, type=1, num_dims=4: [1, 12, 2, 2, ] Output 5 : name=309, type=1, num_dims=4: [1, 12, 1, 1, ] Output 6 : name=300, type=1, num_dims=4: [1, 24, 20, 20, ] Output 7 : name=302, type=1, num_dims=4: [1, 24, 10, 10, ] Output 8 : name=304, type=1, num_dims=4: [1, 24, 5, 5, ] Output 9 : name=306, type=1, num_dims=4: [1, 24, 3, 3, ] Output 10 : name=308, type=1, num_dims=4: [1, 24, 2, 2, ] Output 11 : name=310, type=1, num_dims=4: [1, 24, 1, 1, ] yolox init Loading image Image dimensions: 256 256 3 Called into systolic conv Using systolic pooling Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic add Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv Called into systolic conv 0 11 2 0 0 0 84 140 207 264 0 resize took 0 cycles 1.830045 s normalize_transpose took 0 cycles 1.073210 s Done! Pre Process 1 took 0 cycles 2.903357 s Done! Inference 1 took 0 cycles 1.933709 s Done! Pre Process 1 took 0 cycles 0.445910 s

I also changed DDR to 2Gib DDR3, which I get the same result and the model gets stuck at the same place. What might be the problem? Thanks!