google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.
http://cfu-playground.rtfd.io/
Apache License 2.0
459 stars 117 forks source link

Data corruption error (CRC error) #700

Open bala122 opened 1 year ago

bala122 commented 1 year ago

Hi @tcal-x and @mithro , I'm getting the following error on uploading to the arty A7 board upon running the make load command. This hasn't happened before and I'm not sure of the reason why it has come up now.

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[LITEX-TERM] Received firmware download request from the device.
[LITEX-TERM] Uploading /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_sparse/build/software.bin to 0x40000000 (1430104 bytes)...
[LITEX-TERM] Upload calibration... (inter-frame: 10.00us, length: 64)
[LITEX-TERM] Upload to device failed due to data corruption (CRC error)

I'm getting the above error while running with vivado @ 100MHz even though it meets timing. On running at 75MHz it meets timing and there is no error , although upload is slow. The speed is as shown below.

home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc/bin/litex_term --speed 1843200  --kernel /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_sparse/build/software.bin /dev/ttyUSB1

I've tried reducing the speed as well as pointed out in some other issue. When I tried inserting --safe, the word "safe" wasn't popping out anywhere and it wasn't working.

Please let me know about this as soon as possible. I need the board urgently. Thanks a lot, Bala.

tcal-x commented 1 year ago

@bala122 , yes, this is probably related to https://github.com/google/CFU-Playground/issues/657. What cache sizes are you using? I think we saw it specifically when there was no dCache.

I'm not sure if I mentioned it explicitly in the other issue, but there is no way to add --safe to the litex_term line other than cutting and pasting the entire line and adding --safe and rerunning. With --safe, the upload speed is MUCH slower.

I have not tried reducing the UART_SPEED, but that may be worth trying. You would need to set it to the same value (lower than the default 1843200) for both the bitstream/prog step and the load step.

bala122 commented 1 year ago

I'm using around 32B d-cache (16B block), 512B I-cache and 64B L2. It wasn't working with this config I changed D-cache to 64B, 32B block, it worked. Is the low d-cache a problem? How is that so?

bala122 commented 1 year ago

Another weird issue I've encountered is that, when working with big model sizes, calls to the cfu unit are not returning the correct value in the long run. This could be a synthesis issue, but isnt reflected on synthesis. However , upon using a print statement to test the value, it seems to be working fine. For instance:

val_to_be returned= cfu_opx(x,x,x)
val_to_be_returned_test = <something>;
//printf("Check: val test %d val returned %d",val_to_be_returned_test.val_to_be_returned)

On uncommenting the print, it works fine

Another way this issue wasn't coming up was that I added some dummy statements before the main code and it was working fine. I'm guessing there is some intricate bug in Vexriscv with respect to cfu function calls.

tcal-x commented 1 year ago

Hi @bala122 , I've been trying to reproduce the issue myself and have not been successful (I have not seen either of the software upload issues). I'm am glad you found some working configurations. Yes, I think there are many variables that interact, and perhaps a bug, so that you see the issue only sometimes.