Share of our work about an nv_small system

SCLUO commented 5 years ago

Hi,

I’d like to share part of our work on an nv_small system. This system consists of an nv_small DLA, controlled by a RISC-V MCU. Title: Lightweight DNN Processor Design

In this system, we set up a tool to generate DLA CFG files, following the same format as the sanity test in DLA master ( which is similar to the input.txn in DLA v1). We extended the CFG from a single layer to many layers, including layer fusion and partition. We also configured DLA’s double register files. (a brief flow is described on page 9)

Retraining weights are must, and the retraining flow must suit for the precision propagation in DLA. Before the re-trained weights are ready, there is a pattern generator and a outcome checker for barely functional test.

There is an RTL waveform ( on page 12 ) of whole tiny YOLO v1 inference, except the detection layer. Some stories are told if you read carefully the execution cycle at each layer. Now some bugs exist in the image mode, so we just used direct convolution mode instead. A preliminary ASIC layout is shown in the end, which is not the final result, just a guide of the chip size.

We are with a research institute, and our project goal is to build an express solution to commercialize.

Welcome suggestions and comments.

prasshantg commented 5 years ago

That is really good work @SCLUO

Do you have mAP numbers for YOLOv1 with this configuration?

SCLUO commented 5 years ago

Thanks. We have mAP numbers vary from 10% to 90% because of some differences in retraining flow, or internal precision adjustment. Develop a retraining flow is quite critical to really use the accelerator. We are also considering the possibility of a more "general purpose" retraining flow, but it still diverges now.

kouzhentao commented 5 years ago

Hi

As far as I know, PDP has input width limitation for on the fly processing from SDP(128 for nv_small?). From your slides, it seems PDP is on the pipeline for Yolo. How could u do this? thanks.

Kou

SCLUO commented 5 years ago

To fuse MACC-SDP-PDP, a large input needs to be cut, partition, in width direction.

zhouweiscut commented 5 years ago

@SCLUO great work. but i have some questions. because i cannot access to your website, some may very easy. or you can send me some doc about your design to my email, thanks.

could you describe the retraining flow in details ? do you mean that you changed YOLOv1 network cfg and weight shown in YOLO official website ?
you mentioned about "layer fusion and partition", how does it work ?
because when training we always use float point data type, but in RTL hardware, it just support INT8/INT16, how do you convert fp to INT in your system ?

SCLUO commented 5 years ago

"Retraining" does not change the NN model, but compress the weight from FP32 to INT8. Layer fusion is an advantage of NVDLA, while the key is to set registers correctly. Partition depends on the size of CONV buffer. There is a minimal size of weight and input data, which can be calculated by the atomic-k, atomic-c, and kernel size.

wxbbuaa2011 commented 5 years ago

Nice work，but I want to know more in detail.

zhouweiscut commented 5 years ago

ok, thanks @SCLUO

nookfoo commented 5 years ago

Hi @SCLUO, are you looking to publish a paper regarding your work? I would be really interested in learning more about it.

SCLUO commented 5 years ago

@nookfoo, we have a plan to publish, will share after confirmed. BTW, our earlier demonstration may inspire interesting thinking about applications: https://www.youtube.com/watch?v=sQ9oIjHF5ac Of course, the accuracy and speed show some room to improve. We are still working on it.

huangwei858 commented 5 years ago

@SCLUO good news for us while you have done many efforts in NVDLA patch. We stuck some error in DLA. It's a best way if your company can work with us? We will build a nice relationship with your company. Please send me a private message if you have seen this patch reply. Thank you.

huangwei858 commented 5 years ago

@nookfoo, we have a plan to publish, will share after confirmed. BTW, our earlier demonstration may inspire interesting thinking about applications: https://www.youtube.com/watch?v=sQ9oIjHF5ac Of course, the accuracy and speed show some room to improve. We are still working on it.

good job

wxbbuaa2011 commented 5 years ago

It’s really good news, I hope to see it as soon as possible.

ghost commented 5 years ago

Test on ZCU102 Board . = Run NN/NN_L0_1_small_fbuf creating new runtime context... Emulator starting submitting tasks... [ 2210.896668] Enter:dla_read_network_config [ 2210.903672] Exit:dla_read_network_config status=0 [ 2210.909643] Enter: dla_initiate_processors [ 2210.915004] Enter: dla_submit_operation [ 2210.920099] Prepare Convolution operation index 0 ROI 0 dep_count 1 [ 2210.927667] Enter: dla_prepare_operation [ 2210.932896] processor:Convolution group:1, rdma_group:0 available [ 2210.940346] Enter: dla_read_config [ 2210.945125] Exit: dla_read_config [ 2210.949831] Exit: dla_prepare_operation status=0 [ 2210.955870] Enter: dla_program_operation [ 2210.961236] Program Convolution operation index 0 ROI 0 Group[1] [ 2210.968764] no desc get due to index==-1 [ 2210.974179] no desc get due to index==-1 [ 2210.979577] no desc get due to index==-1 [ 2210.984954] no desc get due to index==-1 [ 2210.990313] no desc get due to index==-1 [ 2210.995644] Enter: dla_op_programmed [ 2211.000630] Update dependency operation index 2 ROI 0 DEP_COUNT=2 [ 2211.008178] Update dependency operation index 1 ROI 0 DEP_COUNT=1 [ 2211.015723] enable SDP in dla_update_dependency as depdency are resolved [ 2211.023905] Enter: dla_enable_operation [ 2211.029220] exit dla_enable_operation without actual enable due to processor hasn't been programmed [ 2211.039839] Exit: dla_enable_operation status=0 [ 2211.045950] Exit: dla_op_programmed [ 2211.050997] Exit: dla_program_operation status=0 [ 2211.057176] Exit: dla_submit_operation [ 2211.062484] Enter: dla_dequeue_operation [ 2211.067953] Dequeue op from Convolution processor, index=2 ROI=0 [ 2211.075536] Enter: dla_submit_operation [ 2211.080938] Prepare Convolution operation index 2 ROI 0 dep_count 1 [ 2211.088792] Enter: dla_prepare_operation [ 2211.094292] processor:Convolution group:0, rdma_group:0 available [ 2211.101991] Enter: dla_read_config [ 2211.106995] Exit: dla_read_config [ 2211.111900] Exit: dla_prepare_operation status=0 [ 2211.118105] Enter: dla_program_operation [ 2211.123600] Program Convolution operation index 2 ROI 0 Group[0] [ 2211.131226] no desc get due to index==-1 [ 2211.136737] no desc get due to index==-1 [ 2211.142218] no desc get due to index==-1 [ 2211.147668] no desc get due to index==-1 [ 2211.153083] no desc get due to index==-1 [ 2211.158471] Enter: dla_op_programmed [ 2211.163499] Update dependency operation index 6 ROI 0 DEP_COUNT=3 [ 2211.171072] Update dependency operation index 3 ROI 0 DEP_COUNT=2 [ 2211.178616] Exit: dla_op_programmed [ 2211.183552] Exit: dla_program_operation status=0 [ 2211.189625] Exit: dla_submit_operation [ 2211.194807] Exit: dla_dequeue_operation [ 2211.200064] Enter: dla_submit_operation [ 2211.205327] Prepare SDP operation index 1 ROI 0 dep_count 0 [ 2211.212329] Enter: dla_prepare_operation [ 2211.217670] processor:SDP group:0, rdma_group:1 available [ 2211.224484] Enter: dla_read_config [ 2211.229290] Exit: dla_read_config [ 2211.233999] Exit: dla_prepare_operation status=0 [ 2211.240023] Enter: dla_program_operation [ 2211.245346] Program SDP operation index 1 ROI 0 Group[0] [ 2211.252076] no desc get due to index==-1 [ 2211.257384] no desc get due to index==-1 [ 2211.262694] no desc get due to index==-1 [ 2211.268009] no desc get due to index==-1 [ 2211.273323] Enter: dla_op_programmed [ 2211.278305] Update dependency operation index 3 ROI 0 DEP_COUNT=1 [ 2211.285843] enable SDP in dla_update_dependency as depdency are resolved [ 2211.293989] Enter: dla_enable_operation [ 2211.299287] exit dla_enable_operation without actual enable due to processor hasn't been programmed [ 2211.309866] Exit: dla_enable_operation status=0 [ 2211.315901] Exit: dla_op_programmed [ 2211.320850] Exit: dla_program_operation status=0 [ 2211.326893] Enter: dla_enable_operation [ 2211.332148] Enable SDP operation index 1 ROI 0 [ 2211.338022] Enter: dla_op_enabled [ 2211.342757] Update dependency operation index 0 ROI 0 DEP_COUNT=1 [ 2211.350310] enable Convolution in dla_update_dependency as depdency are resolved [ 2211.359193] Enter: dla_enable_operation [ 2211.364502] Enable Convolution operation index 0 ROI 0 [ 2211.371107] Enter: dla_op_enabled [ 2211.375865] Exit: dla_op_enabled [ 2211.380508] Exit: dla_enable_operation status=0 [ 2211.386447] Exit: dla_op_enabled [ 2211.391083] Exit: dla_enable_operation status=0 [ 2211.397023] Exit: dla_submit_operation [ 2211.402167] Enter: dla_dequeue_operation [ 2211.407468] Dequeue op from SDP processor, index=3 ROI=0 [ 2211.414162] Enter: dla_submit_operation [ 2211.419400] Prepare SDP operation index 3 ROI 0 dep_count 0 [ 2211.426425] Enter: dla_prepare_operation [ 2211.431824] processor:SDP group:1, rdma_group:0 available [ 2211.438743] Enter: dla_read_config [ 2211.443667] Exit: dla_read_config [ 2211.448470] Exit: dla_prepare_operation status=0 [ 2211.454603] Enter: dla_program_operation [ 2211.460027] Program SDP operation index 3 ROI 0 Group[1] [ 2211.466866] no desc get due to index==-1 [ 2211.472307] no desc get due to index==-1 [ 2211.477739] no desc get due to index==-1 [ 2211.483155] no desc get due to index==-1 [ 2211.488544] Enter: dla_op_programmed [ 2211.493573] Update dependency operation index 7 ROI 0 DEP_COUNT=2 [ 2211.501145] Exit: dla_op_programmed [ 2211.506089] Exit: dla_program_operation status=0 [ 2211.512179] Enter: dla_enable_operation [ 2211.517476] Enable SDP operation index 3 ROI 0 [ 2211.523382] Enter: dla_op_enabled [ 2211.528145] Update dependency operation index 2 ROI 0 DEP_COUNT=1 [ 2211.535711] enable Convolution in dla_update_dependency as depdency are resolved [ 2211.544641] Enter: dla_enable_operation [ 2211.550034] Enable Convolution operation index 2 ROI 0 [ 2211.556769] Enter: dla_op_enabled [ 2211.561671] Exit: dla_op_enabled [ 2211.566466] Exit: dla_enable_operation status=0 [ 2211.572550] Exit: dla_op_enabled [ 2211.577296] Exit: dla_enable_operation status=0 [ 2211.583307] Exit: dla_submit_operation [ 2211.588484] Exit: dla_dequeue_operation [ 2211.593743] Enter: dla_submit_operation [ 2211.598996] Prepare PDP operation index 5 ROI 0 dep_count 1 [ 2211.606013] Enter: dla_prepare_operation [ 2211.611383] processor:PDP group:1, rdma_group:1 available [ 2211.618250] Enter: dla_read_config [ 2211.623101] Exit: dla_read_config [ 2211.627832] Exit: dla_prepare_operation status=0 [ 2211.633842] Enter: dla_program_operation [ 2211.639146] Program PDP operation index 5 ROI 0 Group[1] [ 2211.645852] group id 1 rdma id 1 [ 2211.650493] no desc get due to index==-1 [ 2211.655813] no desc get due to index==-1 [ 2211.661119] no desc get due to index==-1 [ 2211.666410] no desc get due to index==-1 [ 2211.671689] no desc get due to index==-1 [ 2211.676943] Enter: dla_op_programmed [ 2211.681837] Update dependency operation index 11 ROI 0 DEP_COUNT=2 [ 2211.689363] Exit: dla_op_programmed [ 2211.694182] Exit: dla_program_operation status=0 [ 2211.700123] Exit: dla_submit_operation [ 2211.705173] Enter: dla_dequeue_operation [ 2211.710387] Dequeue op from PDP processor, index=11 ROI=0 [ 2211.717109] Enter: dla_submit_operation [ 2211.722266] Prepare PDP operation index 11 ROI 0 dep_count 1 [ 2211.729268] Enter: dla_prepare_operation [ 2211.734523] processor:PDP group:0, rdma_group:0 available [ 2211.741271] Enter: dla_read_config [ 2211.746045] Exit: dla_read_config [ 2211.750733] Exit: dla_prepare_operation status=0 [ 2211.756763] Enter: dla_program_operation [ 2211.762096] Program PDP operation index 11 ROI 0 Group[0] [ 2211.768910] group id 0 rdma id 0 [ 2211.773538] no desc get due to index==-1 [ 2211.778850] no desc get due to index==-1 [ 2211.784149] no desc get due to index==-1 [ 2211.789427] no desc get due to index==-1 [ 2211.794682] no desc get due to index==-1 [ 2211.799906] Enter: dla_op_programmed [ 2211.804748] Update dependency operation index 22 ROI 0 DEP_COUNT=2 [ 2211.812207] Exit: dla_op_programmed [ 2211.816961] Exit: dla_program_operation status=0 [ 2211.822846] Exit: dla_submit_operation [ 2211.827860] Exit: dla_dequeue_operation [ 2211.832958] Enter: dla_submit_operation [ 2211.838041] Prepare CDP operation index 4 ROI 0 dep_count 2 [ 2211.844894] Enter: dla_prepare_operation [ 2211.850106] processor:CDP group:1, rdma_group:1 available [ 2211.856828] Enter: dla_read_config [ 2211.861562] Exit: dla_read_config [ 2211.866189] Exit: dla_prepare_operation status=0 [ 2211.872116] Enter: dla_program_operation [ 2211.877340] Program CDP operation index 4 ROI 0 Group[1] [ 2211.883957] Enter: dla_cdp_program [ 2211.883960] Enter: processor_cdp_program [ 2211.893984] Exit: processor_cdp_program [ 2211.893985] Exit: dla_cdp_program [ 2211.899117] no desc get due to index==-1 [ 2211.908890] no desc get due to index==-1 [ 2211.914053] no desc get due to index==-1 [ 2211.919231] no desc get due to index==-1 [ 2211.924404] no desc get due to index==-1 [ 2211.929580] Enter: dla_op_programmed [ 2211.934423] Update dependency operation index 10 ROI 0 DEP_COUNT=3 [ 2211.941907] Exit: dla_op_programmed [ 2211.946681] Exit: dla_program_operation status=0 [ 2211.952573] Exit: dla_submit_operation [ 2211.957589] Enter: dla_dequeue_operation [ 2211.962782] Dequeue op from CDP processor, index=10 ROI=0 [ 2211.969474] Enter: dla_submit_operation [ 2211.974592] Prepare CDP operation index 10 ROI 0 dep_count 2 [ 2211.981545] Enter: dla_prepare_operation [ 2211.986748] processor:CDP group:0, rdma_group:0 available [ 2211.993436] Enter: dla_read_config [ 2211.998119] Exit: dla_read_config [ 2212.002687] Exit: dla_prepare_operation status=0 [ 2212.008579] Exit: dla_submit_operation [ 2212.013591] Exit: dla_dequeue_operation [ 2212.018685] Exit: dla_initiate_processors status=0 [ 2212.024728] Enter:dla_handle_events, processor:BDMA [ 2212.030851] Exit:dla_handle_events, ret:0 [ 2212.036104] Enter:dla_handle_events, processor:Convolution [ 2212.042876] Exit:dla_handle_events, ret:0 [ 2212.048188] Enter:dla_handle_events, processor:SDP [ 2212.054317] Exit:dla_handle_events, ret:0 [ 2212.059688] Enter:dla_handle_events, processor:PDP [ 2212.065855] Exit:dla_handle_events, ret:0 [ 2212.071217] Enter:dla_handle_events, processor:CDP [ 2212.077357] Exit:dla_handle_events, ret:0 [ 2212.082717] Enter:dla_handle_events, processor:RUBIK [ 2212.089054] Exit:dla_handle_events, ret:0 [ 2212.094454] Enter:dla_handle_events, processor:BDMA [ 2212.100733] Exit:dla_handle_events, ret:0 [ 2212.106130] Enter:dla_handle_events, processor:Convolution [ 2212.113015] Exit:dla_handle_events, ret:0 [ 2212.118409] Enter:dla_handle_events, processor:SDP [ 2212.124584] Exit:dla_handle_events, ret:0 [ 2212.129985] Enter:dla_handle_events, processor:PDP [ 2212.136181] Exit:dla_handle_events, ret:0 [ 2212.141594] Enter:dla_handle_events, processor:CDP [ 2212.147788] Exit:dla_handle_events, ret:0 [ 2212.153183] Enter:dla_handle_events, processor:RUBIK [ 2212.159561] Exit:dla_handle_events, ret:0

Dead here, can't keep running.

What could be the cause of this problem? Has anyone ever had the same problem? How to fix it.

wxbbuaa2011 commented 5 years ago

@SCLUO What are the parts of open source? For example the compiler.

SCLUO commented 5 years ago

@huangwei858 and @wxbbuaa2011 Currently our tools / compiler / retrain flows are confidential. I know many people want some more info and open source. We will discuss internal, also wait for the release from offical nVidia compiler (another issue says it will be in 19'Q1), and then we will make decision.
Thanks for asking, if you would like to know more, we may discuss through email. scluo@itri.org.tw

wxbbuaa2011 commented 5 years ago

@SCLUO Thank you.

SCLUO commented 5 years ago

Update our status. Please take a look to our presentation in recent VLSI-DAT 2019. https://www.slideshare.net/ShienChunLuo/customization-of-a-deep-learning-accelerator-based-on-nvdla

Here we improves the precision (quantize-aware training) and the speed (successfully porting nv_small_256 instead of nv_small_64).

wxbbuaa2011 commented 5 years ago

@SCLUO Nice work! What is the name of your article?

SCLUO commented 5 years ago

We found many people have quite a lot issues in building NVDLA on FPGAs. So we are discussing with our managers to open some "quick" evaluation kits, such as our YOLO demo or classification demo. Then you can just copy the image into an SD card, and then boot up ZCU102 and enjoy the results.

Also, the RTL of nv_small here is almost fine, meaning only need some slight revisions and can implement into an FPGA. For real applications, the main efforts remain to compile NN and to re-train for 8-bit quantization IFM and weights. So the synthesizable RTL for ZCU102 may be also releasable. But for the compile and quantize parts, we may still packed them into some kind of binary executable.

However, above-mentioned items are under discussion. I am here collecting some open comments. Are you interested in some quick evaluation? or there is still not much help if it's just a demo?

wxbbuaa2011 commented 5 years ago

@SCLUO Nice work, I am very much looking forward to your open source version that can run on the FPGA. It would be better if an executable compiler could run other networks.

wxbbuaa2011 commented 5 years ago

I really want to keep an eye on the progress of this project.

wxbbuaa2011 commented 5 years ago

Recently, open source compilers have been able to support small nvdla.

SCLUO commented 4 years ago

Soon, we will share FPGA implementation package on github.

wxbbuaa2011 commented 4 years ago

@SCLUO I'm looking forwarding to see your research work.

SCLUO commented 4 years ago

We open the express implementation of nv_small_64, the RTL ready for FPGA synthesis. We also put prebuilt bitstream, a little test patterns, two standard CNN (Resnet50, Tiny YOLO) running with bare-metal codes and GUI interface if you just want to see some deom on FPGA. https://github.com/SCLUO/ITRI-OpenDLA

prasshantg commented 4 years ago

@SCLUO thank you for sharing the work!!

wxbbuaa2011 commented 4 years ago

@SCLUO Thank you for sharing the work!

SCLUO commented 4 years ago

We also revised the performance estimation excel sheets from here, and make it to merge with netron and netscope. An executable free trial is here. As you may wonder the accuracy? It's about ~10% difference when we compared aggressive FPS to RTL simulations. https://github.com/SCLUO/Open-DLA-Performance-Profiler

wxbbuaa2011 commented 4 years ago

@SCLUO Thank you for your great work!

timzhang32 commented 2 years ago

Hi,

I’d like to share part of our work on an nv_small system. This system consists of an nv_small DLA, controlled by a RISC-V MCU. Title: Lightweight DNN Processor Design

In this system, we set up a tool to generate DLA CFG files, following the same format as the sanity test in DLA master ( which is similar to the input.txn in DLA v1). We extended the CFG from a single layer to many layers, including layer fusion and partition. We also configured DLA’s double register files. (a brief flow is described on page 9)

Retraining weights are must, and the retraining flow must suit for the precision propagation in DLA. Before the re-trained weights are ready, there is a pattern generator and a outcome checker for barely functional test.

There is an RTL waveform ( on page 12 ) of whole tiny YOLO v1 inference, except the detection layer. Some stories are told if you read carefully the execution cycle at each layer. Now some bugs exist in the image mode, so we just used direct convolution mode instead. A preliminary ASIC layout is shown in the end, which is not the final result, just a guide of the chip size.

We are with a research institute, and our project goal is to build an express solution to commercialize.

Welcome suggestions and comments.

Hi,

I'm currently implementing YOLOv1 on ZCU102, too. Since you guys already succeeded, may I ask you a technical question regarding the implementation? In the original prototxt, YOLO has a PReLU layer as follows:

layer { name: "relu1" type: "ReLU" bottom: "scale1" top: "relu1" relu_param { negative_slope: 0.1 } }

It can be parsed by the compiler, but on NVDLA, it's automatically executed as a normal ReLU layer. I wonder how you guys dealt with this. Did you have the same issue? Is there a way to activate PReLU in sdp.c in the KMD?

Thanks.

Best Tim

SCLUO commented 2 years ago

Hi,

I'm currently implementing YOLOv1 on ZCU102, too. Since you guys already succeeded, may I ask you a technical question regarding the implementation? In the original prototxt, YOLO has a PReLU layer as follows:

layer { name: "relu1" type: "ReLU" bottom: "scale1" top: "relu1" relu_param { negative_slope: 0.1 } }

It can be parsed by the compiler, but on NVDLA, it's automatically executed as a normal ReLU layer. I wonder how you guys dealt with this. Did you have the same issue? Is there a way to activate PReLU in sdp.c in the KMD?

Thanks.

Best Tim

We build a new compiler for the NVDLA. Please search the keyword " ITRI DLA " to have some idea. I will close this issue.

SCLUO commented 2 years ago

https://sites.google.com/view/itri-icl-dla

nvdla / hw

Share of our work about an nv_small system #244