google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.
http://cfu-playground.rtfd.io/
Apache License 2.0
455 stars 116 forks source link

VexRiscv configuration with iCache alone hangs (SFL related) #657

Open ShvetankPrakash opened 2 years ago

ShvetankPrakash commented 2 years ago

The following Vexriscv configuration hangs when you try to run on the board:

generate+bypass:false+csrPluginConfig:mcycle+dCacheSize:0+hardwareDiv:false+iCacheSize:8192+mulDiv:false+prediction:none+safe:false+singleCycleMulDiv:false+singleCycleShift:false

It is all CPU params set to false/zero except for iCacheSize set to 8192.

tcal-x commented 2 years ago

I think I reproduced this hang on the board, building directly in proj/mnv2_first/ (not in the DSE project).

I used Vivado. So if this is what you were seeing, that's interesting that the same issue occurred on a different board, using a different toolchain. That would indicate that it's an issue in the design, not in the implementation.

The Vivado run met timing.

This is what I saw:

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[LITEX-TERM] Received firmware download request from the device.
[LITEX-TERM] Uploading /media/tim/GIT/google/CFU-Playground/proj/mnv2_first/build/software.bin to 0x40000000 (2082744 bytes)...
[LITEX-TERM] Upload calibration... (inter-frame: 40.00us, length: 64)
[LITEX-TERM] Got unexpected response from device 'b' ''

So, this indicates an issue downloading the software binary over the serial link. This is litex_term on the host talking to the BIOS running on the VexRiscv on the board.

I tried connecting using litex_term in "safe mode" (note the added --safe option):

/media/tim/GIT/google/CFU-Playground/soc/bin/litex_term --safe --speed 1843200  --kernel /media/tim/GIT/google/CFU-Playground/proj/mnv2_first/build/software.bin /dev/ttyUSB1

With this, the CFU Playground software did download and boot correctly, although it was very slow.

Currently I don't have any theories about what could be causing this behavior.

ShvetankPrakash commented 2 years ago

If you build in the mnv2 proj I think the software will be incorrect since it does not account for the cfu=False in the variant name. However, that would make the unit/golden tests fail I believe. This hang is occurring before then as you show above before even having the software bin downloaded. It is really weird to only be able to reproduce the error on your board using Vivado and I get the hang using F4PGA...

tcal-x commented 2 years ago

To clarify, I only tried Vivado, and saw the hang there. I will try reproducing using Symbiflow now (my guess is that I'll see the same issue).

tcal-x commented 2 years ago

Hmm, with Symbiflow with sysclk=75MHz, for me, with several attempts connecting to the board and downloading the firmware, it mostly does work, but sometimes doesn't.

Working:

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[LITEX-TERM] Received firmware download request from the device.
[LITEX-TERM] Uploading /media/tim/GIT/google/CFU-Playground/proj/mnv2_first/build/software.bin to 0x40000000 (2082744 bytes)...
[LITEX-TERM] Upload calibration... (inter-frame: 640.00us, length: 64)
[LITEX-TERM] Upload complete (63.8KB/s).
[LITEX-TERM] Booting the device.
[LITEX-TERM] Done.
Executing booted program at 0x40000000

Not working:

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[LITEX-TERM] Received firmware download request from the device.
[LITEX-TERM] Uploading /media/tim/GIT/google/CFU-Playground/proj/mnv2_first/build/software.bin to 0x40000000 (2082744 bytes)...
[LITEX-TERM] Upload calibration... (inter-frame: 320.00us, length: 64)
[LITEX-TERM] Upload to device failed due to data corruption (CRC error)

Notice the difference in the inter-frame period.

This might be something that can be improved in the SFL code (I know Florent has made adjustments to it in the past).

I still don't have a theory about what it could be in this particular design that causes it to have this issue.