Open SanadaShinken opened 3 years ago
Hi,
On which board are you ? Did you checked the place and route routing report ?
BR Charles
Hi Charles:
I had tested on QMtech Wukong board and our own fpga board. both fpga chip of the boards is XC7A100T, only the gpios are different.
about P&R, I had not checked. Could you do me a favour to tell me how to check P&R result.
BTW, If I enable FPU support on One CPU, the result is the same as multiple CPUs.
BR, Sanada
Do you mean that even the BIOS is not starting with --cpu-count 2 (while it works fine with single core)?
If so, I see the same issue on the ecpix5. The log looks fine: both utilization (TRELLIS_SLICE: 16826/41820 40%) and max frequency ('$glbnet$sdrio_clk': 65.02 MHz (PASS at 50.00 MHz)). I don't have a good idea where to continue debugging.
First thing to do when stuff is broken, is to check if it run in the litex simulation with a similar config. (https://github.com/litex-hub/linux-on-litex-vexriscv#running-the-litex-simulation)
Can you give a try ? This is the main factor which will descide how to debug it.
Do you mean that even the BIOS is not starting with --cpu-count 2 (while it works fine with single core)?
If so, I see the same issue on the ecpix5. The log looks fine: both utilization (TRELLIS_SLICE: 16826/41820 40%) and max frequency ('$glbnet$sdrio_clk': 65.02 MHz (PASS at 50.00 MHz)). I don't have a good idea where to continue debugging.
Hi @jluebbe :
Yes!!! You are right. the BIOS is not booting in 2 CPUs configuration. I'll check the P&R's result.
BR, Sanada
First thing to do when stuff is broken, is to check if it run in the litex simulation with a similar config. (https://github.com/litex-hub/linux-on-litex-vexriscv#running-the-litex-simulation)
Can you give a try ? This is the main factor which will descide how to debug it.
Hi @Dolu1990 :
OK! I see. I'll do some tests under simulation configuration with different CPU counts. Many Thanks.
BR, Sanada
Hi @Dolu1990 :
the log file ,sim_cpux4.log, is the "./sim.py --cpu-counts 4" outcome. sim_cpux4.log
the log file ,sim_cpux1.log, is the "./sim.py --cpu-counts 1" outcome. sim_cpux1.log
the image is the prebuild image. https://github.com/litex-hub/linux-on-litex-vexriscv/issues/164
in sim_cpux4.log, the loading process seems stop at "[ 0.422129] Unpacking initramfs..."
BR, Sanada
@SanadaShinken
Thanks
how long did you run the x4 sim compared to the x1 sim ? the thing is that the x4 is significantly slower to simulate, and the unpack is taking quite a long time :
[ 0.428819] Unpacking initramfs... [ 3.431514] Freeing initrd memory: 8192K
anyway, i will give a try tomorrow on artyA7 35T 2 cores
@SanadaShinken
Thanks
how long did you run the x4 sim compared to the x1 sim ? the thing is that the x4 is significantly slower to simulate, and the unpack is taking quite a long time :
[ 0.428819] Unpacking initramfs... [ 3.431514] Freeing initrd memory: 8192K
anyway, i will give a try tomorrow on artyA7 35T 2 cores
Hi, @Dolu1990 :
Many Thanks.
about 3 houres waiting for 4 cpus simulation. about 1 houres for 1 cpu simulation.
BR, Sanada
after updateding the litex tool, I just tried :
./make.py --cpu-count 2 --local-ip=192.168.0.159 --remote-ip=192.168.0.24 --build --load --board=arty
This worked fine on my Arty A7 35T
Can you try on your board, especialy with --cpu-count 2 ?
Hi @Dolu1990 :
The result is the same. still fail. no BIOS output.....
BR, Sanada
After some help by @enjoy-digital on #litex I was able to build a working dual core image. The trick was to disable the l2_size: https://github.com/jluebbe/linux-on-litex-vexriscv/commit/dd2d33f7a9b688643ac1d57dc3c1fba8490aff90
@jlubbe > "l2_size"
Can you tell me more about this ? i'm not aware why there was some bad interrations.
@jlubbe > "l2_size"
Can you tell me more about this ? i'm not aware why there was some bad interrations.
I don't know the actual cause either. This change was proposed by @enjoy-digital in the IRC channel (florent). There are logs here: https://freenode.irclog.whitequark.org/litex/2021-05-04#29825181;
@Dolu1990: @rdolbeau and I thought about this because it was the main difference between Arty and ECPIX-5 but I haven't investigated more yet. I'll try to understand. If this related to the wishbone interface + cpu_count > 1, we should also be able to reproduce the behavior on other boards.
@SanadaShinken The Qmtech Wukong works fine in SMP for me (4 VexRiscv with extra instructions) - but I don't use the L2 cache and Wishbone interface, instead I use the native litedram, as mentioned by @enjoy-digital ; my codebase is from around mid-april. in make.py my board looks like this:
class qmtech_wukong(Board):
SPIFLASH_PAGE_SIZE = 256
SPIFLASH_SECTOR_SIZE = 64*kB
SPIFLASH_DUMMY_CYCLES = 7
soc_kwargs = {
"sys_clk_freq": 100e6,
"with_video_framebuffer": True,
"video_timing": "800x600@60Hz",
"ps2kbd": True
}
def __init__(self):
from litex_boards.targets import qmtech_wukong
Board.__init__(self, qmtech_wukong.BaseSoC, soc_capabilities={
"serial",
"ethernet",
"sdcard",
"leds",
"icap_bitstream",
}, bitstream_ext=".bit")
@SanadaShinken The Qmtech Wukong works fine in SMP for me (4 VexRiscv with extra instructions) - but I don't use the L2 cache and Wishbone interface, instead I use the native litedram, as mentioned by @enjoy-digital ; my codebase is from around mid-april. in make.py my board looks like this:
class qmtech_wukong(Board): SPIFLASH_PAGE_SIZE = 256 SPIFLASH_SECTOR_SIZE = 64*kB SPIFLASH_DUMMY_CYCLES = 7 soc_kwargs = { "sys_clk_freq": 100e6, "with_video_framebuffer": True, "video_timing": "800x600@60Hz", "ps2kbd": True } def __init__(self): from litex_boards.targets import qmtech_wukong Board.__init__(self, qmtech_wukong.BaseSoC, soc_capabilities={ "serial", "ethernet", "sdcard", "leds", "icap_bitstream", }, bitstream_ext=".bit")
Hi @rdolbeau :
Very Thanks for your setting sharing. Befor April 1st, the 4 cores run very smooth. after April 1st., the 4 cores going done. BTW, the system bus which @rdolbeau used is AXI lite or not?
Before April 1st, I could build 4 CPU + 4 FPU + AES instruction + Video Framebuffer enabled on QMTech Wukong Board. please see the attachment picture is the P&R result which got from VIVADO 2020.1.
BR, Sanada
@SanadaShinken Try without the L2 then, it should work fine.
I have no idea what the system bus is, it's whatever Litex has by default with the configuration above, think it is still wishbone but the DRAM is connected differently without the L2? @enjoy-digital will know better...
Hi @rdolbeau :
Many Thanks. Set L2 size to zero, every thing is all right.
BTW, I found a very interesting point.
The DDR3 memory chip is MT41K512M16HA, 1Gbytes.
If set L2 size to 8192, the memory speed is about --=============== SoC ==================-- CPU: VexRiscv SMP-LINUX @ 100MHz BUS: WISHBONE 32-bit @ 4GiB CSR: 32-bit data ROM: 64KiB SRAM: 8KiB L2: 8KiB SDRAM: 1048576KiB 16-bit @ 800MT/s (CL-6 CWL-5) --========== Initialization ============-- Write speed: 31MiB/s Read speed: 20MiB/s
If set L2 size to zero, the memory speed is about: --=============== SoC ==================-- CPU: VexRiscv SMP-LINUX @ 100MHz BUS: WISHBONE 32-bit @ 4GiB CSR: 32-bit data ROM: 64KiB SRAM: 8KiB L2: 0KiB SDRAM: 1048576KiB 16-bit @ 800MT/s (CL-6 CWL-5) --========== Initialization ============-- Write speed: 31MiB/s Read speed: 26MiB/s
the Write operqation seems not effected by L2 cache setting. but the Read operation is effected. Any Idea?
BR, Sanada
@SanadaShinken
the Write operqation seems not effected by L2 cache setting. but the Read operation is effected. Any Idea?
I gess that :
For write operations, VexRiscv do not need to wait for the completion of the write, so cache miss in the L2 due to a write will only stall the CPU if the through the DBus cmd stream, which has quite some buffers on the way from the l1 to the l2 => no stall
For the read, basicaly, a i guess there is some penality to refill the L2, and then the L1, instead of directly refilling the L1
@SanadaShinken
the Write operqation seems not effected by L2 cache setting. but the Read operation is effected. Any Idea?
I gess that :
For write operations, VexRiscv do not need to wait for the completion of the write, so cache miss in the L2 due to a write will only stall the CPU if the through the DBus cmd stream, which has quite some buffers on the way from the l1 to the l2 => no stall
For the read, basicaly, a i guess there is some penality to refill the L2, and then the L1, instead of directly refilling the L1
Hi @Dolu1990 :
Thank you for your explanation. I think you are right. It seems the operation cost is not the same between write and read. under cache enable or not, the speed of write is the same. for read, the extra cost reduce the speed.
BTW, the memory access speed is normal or not ? I'm very wonder for this performance issue.
BR, Sanada
@SanadaShinken The BIOS speed measurement might not be representative of how much bandwidth you can use, not even from one core. On my Qmtech Wukong with four 100 MHz core sharing a single FPU and using the STREAM benchmark (FORTRAN version with reduced memory footprint but otherwise unchanged), I get for 1/2/4 threads (somewhat abbreviated output):
The total memory requirement is 45 MB
Number of Threads = 1
Function Rate (MB/s) Avg time Min time Max time
Copy: 61.3242 0.5232 0.5218 0.5242
Scale: 43.6510 0.7349 0.7331 0.7367
Add: 56.2504 0.8550 0.8533 0.8562
Triad: 48.3942 0.9936 0.9919 0.9957
(...)
Number of Threads = 2
Function Rate (MB/s) Avg time Min time Max time
Copy: 85.1240 0.3769 0.3759 0.3786
Scale: 72.1662 0.4446 0.4434 0.4465
Add: 92.8279 0.5178 0.5171 0.5188
Triad: 86.7946 0.5547 0.5530 0.5559
(...)
Number of Threads = 4
Function Rate (MB/s) Avg time Min time Max time
Copy: 93.6368 0.3461 0.3417 0.3502
Scale: 89.0219 0.3616 0.3595 0.3649
Add: 110.2635 0.4375 0.4353 0.4415
Triad: 108.0181 0.4467 0.4444 0.4507
STREAM uses double-precision value, so each load/store addresses 8 bytes which might be more efficient than what the BIOS does. Absolute numbers may vary depending on other parameters (I have 16 KiB caches and an expanded DTLB), but you very likely need more than one core to saturate the bus/interface - I'm not sure what the bottleneck(s) is/are.
@SanadaShinken
BTW, the memory access speed is normal or not ? I'm very wonder for this performance issue.
Would need a benchmark which unrool the read/write loop. I guess this would make a significant difference.
@SanadaShinken The BIOS speed measurement might not be representative of how much bandwidth you can use, not even from one core. On my Qmtech Wukong with four 100 MHz core sharing a single FPU and using the STREAM benchmark (FORTRAN version with reduced memory footprint but otherwise unchanged), I get for 1/2/4 threads (somewhat abbreviated output):
The total memory requirement is 45 MB Number of Threads = 1 Function Rate (MB/s) Avg time Min time Max time Copy: 61.3242 0.5232 0.5218 0.5242 Scale: 43.6510 0.7349 0.7331 0.7367 Add: 56.2504 0.8550 0.8533 0.8562 Triad: 48.3942 0.9936 0.9919 0.9957 (...) Number of Threads = 2 Function Rate (MB/s) Avg time Min time Max time Copy: 85.1240 0.3769 0.3759 0.3786 Scale: 72.1662 0.4446 0.4434 0.4465 Add: 92.8279 0.5178 0.5171 0.5188 Triad: 86.7946 0.5547 0.5530 0.5559 (...) Number of Threads = 4 Function Rate (MB/s) Avg time Min time Max time Copy: 93.6368 0.3461 0.3417 0.3502 Scale: 89.0219 0.3616 0.3595 0.3649 Add: 110.2635 0.4375 0.4353 0.4415 Triad: 108.0181 0.4467 0.4444 0.4507
STREAM uses double-precision value, so each load/store addresses 8 bytes which might be more efficient than what the BIOS does. Absolute numbers may vary depending on other parameters (I have 16 KiB caches and an expanded DTLB), but you very likely need more than one core to saturate the bus/interface - I'm not sure what the bottleneck(s) is/are.
Hi @rdolbeau :
Thank you very much to share such valuable benchmark. This could be the base of testing the sw code and the hw uint. Very useful !!
BR, Samada
@SanadaShinken
BTW, the memory access speed is normal or not ? I'm very wonder for this performance issue.
Would need a benchmark which unrool the read/write loop. I guess this would make a significant difference.
Hi @Dolu1990 :
It's very kind to tell me to take care the loop unrolling. If I hove some time, I'll do some test and share in this project.
BR, Sanada
@Dolu1990 @SanadaShinken Adding -funroll-loops
when compiling STREAM with gfortran does offer some benefits, diminishing with the number of threads - for one threads up to 10-12%, for 4 threads, it's barely noticeable. At some point the shared FPU might become a bottleneck, as it's doing most of the work. For Triad the inner loop with no unrolling is:
10e7e: 221c fld fa5,0(a2)
10e80: 2198 fld fa4,0(a1)
10e82: 06a1 addi a3,a3,8
10e84: 0785 addi a5,a5,1
10e86: 72f6f7c3 fmadd.d fa5,fa3,fa5,fa4
10e8a: 05a1 addi a1,a1,8
10e8c: 0621 addi a2,a2,8
10e8e: fef6bc27 fsd fa5,-8(a3)
10e92: fee796e3 bne a5,a4,10e7e <MAIN__._omp_fn.7+0x48>
4 out of 9 instructions have to be handled by the shared FPU including the load/stores. Not sure using 2 FPUs would fit in my FPGA, but it might improve throughput for the 4-threads case.
EDIT: 2 FPUs fits, and they do offer some benefits for STREAM, Triad goes all the way up to 114 MB/s (unrolled or not).
For reference with -funroll-loops
it becomes:
11612: 0000b087 fld ft1,0(ra)
11616: 0005b007 fld ft0,0(a1)
1161a: 04070713 addi a4,a4,64
1161e: 04008093 addi ra,ra,64
11622: 0a07f143 fmadd.d ft2,fa5,ft0,ft1
11626: 04058593 addi a1,a1,64
1162a: 0321 addi t1,t1,8
1162c: fc273027 fsd ft2,-64(a4)
11630: fc85b187 fld ft3,-56(a1)
11634: fc80b207 fld ft4,-56(ra)
11638: 2237f2c3 fmadd.d ft5,fa5,ft3,ft4
1163c: fc573427 fsd ft5,-56(a4)
11640: fd05b307 fld ft6,-48(a1)
11644: fd00b387 fld ft7,-48(ra)
11648: 3a67f543 fmadd.d fa0,fa5,ft6,ft7
1164c: fca73827 fsd fa0,-48(a4)
11650: fd85b587 fld fa1,-40(a1)
11654: fd80b607 fld fa2,-40(ra)
11658: 62b7f843 fmadd.d fa6,fa5,fa1,fa2
1165c: fd073c27 fsd fa6,-40(a4)
11660: fe05b887 fld fa7,-32(a1)
11664: fe00be07 fld ft8,-32(ra)
11668: e317fec3 fmadd.d ft9,fa5,fa7,ft8
1166c: ffd73027 fsd ft9,-32(a4)
11670: fe85bf07 fld ft10,-24(a1)
11674: fe80bf87 fld ft11,-24(ra)
11678: fbe7f743 fmadd.d fa4,fa5,ft10,ft11
1167c: fee73427 fsd fa4,-24(a4)
11680: ff05b007 fld ft0,-16(a1)
11684: ff00b687 fld fa3,-16(ra)
11688: 6a07f0c3 fmadd.d ft1,fa5,ft0,fa3
1168c: fe173827 fsd ft1,-16(a4)
11690: ff85b107 fld ft2,-8(a1)
11694: ff80b187 fld ft3,-8(ra)
11698: 1a27f243 fmadd.d ft4,fa5,ft2,ft3
1169c: fe473c27 fsd ft4,-8(a4)
116a0: f7e319e3 bne t1,t5,11612 <MAIN__._omp_fn.7+0x134>
Obviously, on a one-instruction-per-cycle in-order core, GCC-s default unrolling isn't great. Replacing the Triad assembly loop by a properly pipelined one (address computation & loop bound checking can probably be optimized more):
fld ft0,0(a1)
fld ft1,0(ra)
addi a1,a1,64
addi ra,ra,64
fld ft2,-56(a1)
fld ft3,-56(ra)
addi t1,t1,8
fld ft4,-48(a1)
fld ft5,-48(ra)
addi a4,a4,64
fld ft6,-40(a1)
fld ft7,-40(ra)
fld fa0,-32(a1)
fld fa1,-32(ra)
fld fa2,-24(a1)
fld fa3,-24(ra)
fld fa4,-16(a1)
fld ft9,-16(ra)
fld fa6,-8(a1)
fld fa7,-8(ra)
fmadd.d ft1,fa5,ft0,ft1
fmadd.d ft3,fa5,ft2,ft3
fmadd.d ft5,fa5,ft4,ft5
fmadd.d ft7,fa5,ft6,ft7
fmadd.d fa1,fa5,fa0,fa1
fmadd.d fa3,fa5,fa2,fa3
fmadd.d ft9,fa5,fa4,ft9
fmadd.d fa7,fa5,fa6,fa7
fsd ft1,-64(a4)
fsd ft3,-56(a4)
fsd ft5,-48(a4)
fsd ft7,-40(a4)
fsd fa1,-32(a4)
fsd fa3,-24(a4)
fsd ft9,-16(a4)
fsd fa7,-8(a4)
bne t1,t5,.L209
I get 122 MB/s for one thread and 192 MB/s for two (on cores with non-shared FPUs), despite a G++ instance running on another core (also why I don't have 4 threads result, SoC is a bit busy).
Now we need a proper machine description in GCC for those long-latency instructions :-)
@Dolu1990 @SanadaShinken Adding
-funroll-loops
when compiling STREAM with gfortran does offer some benefits, diminishing with the number of threads - for one threads up to 10-12%, for 4 threads, it's barely noticeable. At some point the shared FPU might become a bottleneck, as it's doing most of the work. For Triad the inner loop with no unrolling is:10e7e: 221c fld fa5,0(a2) 10e80: 2198 fld fa4,0(a1) 10e82: 06a1 addi a3,a3,8 10e84: 0785 addi a5,a5,1 10e86: 72f6f7c3 fmadd.d fa5,fa3,fa5,fa4 10e8a: 05a1 addi a1,a1,8 10e8c: 0621 addi a2,a2,8 10e8e: fef6bc27 fsd fa5,-8(a3) 10e92: fee796e3 bne a5,a4,10e7e <MAIN__._omp_fn.7+0x48>
4 out of 9 instructions have to be handled by the shared FPU including the load/stores. Not sure using 2 FPUs would fit in my FPGA, but it might improve throughput for the 4-threads case.
EDIT: 2 FPUs fits, and they do offer some benefits for STREAM, Triad goes all the way up to 114 MB/s (unrolled or not).
For reference with
-funroll-loops
it becomes:11612: 0000b087 fld ft1,0(ra) 11616: 0005b007 fld ft0,0(a1) 1161a: 04070713 addi a4,a4,64 1161e: 04008093 addi ra,ra,64 11622: 0a07f143 fmadd.d ft2,fa5,ft0,ft1 11626: 04058593 addi a1,a1,64 1162a: 0321 addi t1,t1,8 1162c: fc273027 fsd ft2,-64(a4) 11630: fc85b187 fld ft3,-56(a1) 11634: fc80b207 fld ft4,-56(ra) 11638: 2237f2c3 fmadd.d ft5,fa5,ft3,ft4 1163c: fc573427 fsd ft5,-56(a4) 11640: fd05b307 fld ft6,-48(a1) 11644: fd00b387 fld ft7,-48(ra) 11648: 3a67f543 fmadd.d fa0,fa5,ft6,ft7 1164c: fca73827 fsd fa0,-48(a4) 11650: fd85b587 fld fa1,-40(a1) 11654: fd80b607 fld fa2,-40(ra) 11658: 62b7f843 fmadd.d fa6,fa5,fa1,fa2 1165c: fd073c27 fsd fa6,-40(a4) 11660: fe05b887 fld fa7,-32(a1) 11664: fe00be07 fld ft8,-32(ra) 11668: e317fec3 fmadd.d ft9,fa5,fa7,ft8 1166c: ffd73027 fsd ft9,-32(a4) 11670: fe85bf07 fld ft10,-24(a1) 11674: fe80bf87 fld ft11,-24(ra) 11678: fbe7f743 fmadd.d fa4,fa5,ft10,ft11 1167c: fee73427 fsd fa4,-24(a4) 11680: ff05b007 fld ft0,-16(a1) 11684: ff00b687 fld fa3,-16(ra) 11688: 6a07f0c3 fmadd.d ft1,fa5,ft0,fa3 1168c: fe173827 fsd ft1,-16(a4) 11690: ff85b107 fld ft2,-8(a1) 11694: ff80b187 fld ft3,-8(ra) 11698: 1a27f243 fmadd.d ft4,fa5,ft2,ft3 1169c: fe473c27 fsd ft4,-8(a4) 116a0: f7e319e3 bne t1,t5,11612 <MAIN__._omp_fn.7+0x134>
Hi, @rdolbeau :
It is very detail. after my XC7A200T board is ready. I want to repeat your test on my board. BTW, does the linux kernel config and toolchain both enbale FPU support?
BR, Sanada
@Dolu1990 I seem to have stumbled upon a bug, maybe hardware, when pushing things. When running 4 threads, the optimized STREAM would produce wrong results and fails validation. After triple-checking my code, I started playing with the original and realized it would occasionally fails validation as well.
I also crashed the system once:
[57663.021112] Unable to handle kernel access to user memory without uaccess routines at virtual address 00000004
[57663.024133] Oops [#1]
[57663.024576] CPU: 0 PID: 4713 Comm: stream_unrolled Not tainted 5.12.0-171771-gc8543a1ea224-dirty #3
[57663.026866] epc : pick_next_task_fair+0x13e/0x316
[57663.028116] ra : pick_next_task_fair+0x108/0x316
[57663.029320] epc : c00302ac ra : c0030276 sp : c28f7ee0
[57663.030591] gp : c0602de8 tp : c1501a00 t0 : 02e7a225
[57663.031826] t1 : 00000001 t2 : 0016e360 s0 : c28f7f20
[57663.033205] s1 : cfdd6080 a0 : c08b4300 a1 : 6f6c626b
[57663.034616] a2 : 00000000 a3 : 00000000 a4 : ffffffff
[57663.036044] a5 : c0877254 a6 : 00000016 a7 : 00000000
[57663.037437] s2 : c0877240 s3 : cfdd60c0 s4 : 000002f0
[57663.038797] s5 : aafd0498 s6 : c0877240 s7 : c1501d8c
[57663.040133] s8 : 00000000 s9 : 35250420 s10: 34b27460
[57663.041590] s11: 352508e0 t3 : 0000036c t4 : 02d7b621
[57663.042971] t5 : 02e23612 t6 : c057c1d0
[57663.043812] status: 00000100 badaddr: 00000004 cause: 0000000f
[57663.045505] Call Trace:
[57663.045972] [<c00302ac>] pick_next_task_fair+0x13e/0x316
[57663.047068] [<c04521fe>] __schedule+0xb6/0x404
[57663.048152] [<c0452582>] schedule+0x36/0xa0
[57663.049248] [<c000203e>] ret_from_exception+0x0/0xc
After the reboot one of my first attempts led to this on the console (but not a kernel crash):
buildroot login: [ 74.324988] BUG: scheduling while atomic: stream_unrolled/177/0xffff0000
[ 74.326656] CPU: 0 PID: 177 Comm: stream_unrolled Not tainted 5.12.0-171771-gc8543a1ea224-dirty #3
[ 74.328846] Call Trace:
[ 74.329325] [<c00033f6>] walk_stackframe+0x0/0xca
[ 74.330360] [<c044d570>] dump_backtrace+0x38/0x46
[ 74.331361] [<c044d58c>] show_stack+0xe/0x16
[ 74.332264] [<c04514ce>] dump_stack+0x6c/0x8a
[ 74.333265] [<c00257b6>] __schedule_bug+0x56/0x66
[ 74.334272] [<c04524ba>] __schedule+0x372/0x404
[ 74.335411] [<c0452582>] schedule+0x36/0xa0
[ 74.336480] [<c000203e>] ret_from_exception+0x0/0xc
As it's parallelized using OpenMP, it relies on atomic instructions for various synchronizations in libgomp (mostly amoswap
and amoadd
). The final reduction can get messed up if synchronization fails (it also rely on cache coherency to work properly so that results from one core can be read by another).
With fewer than 4 threads, it seems to always work, but the work distribution is different and might hide/mitigate the issue.
Could there be some unreliability of the atomics when I push the caches/memory to the limit like this ? (I get about 220 MB/s from the Triad).
EDIT: after another crash & reboot, the optimized version now validates every time despite it being the exact same binary... weird. Maybe something that was corrupted in the kernel leading to non-working synchronizations? Maybe adding the PmpPlugin would help.
It is very detail. after my XC7A200T board is ready. I want to repeat your test on my board. BTW, does the linux kernel config and toolchain both enbale FPU support?
In the current linux-on-litex-vexriscv, no, you need to enable things by hand mostly - the default configs assume you're running without FPU (or compressed instructions).
You need to enable the FPU in the bitstream with --with-fpu
(and optionally the number of FPUs with --cpu-per-fpu
as they are shared). This should change other parameters to make sure all requirements are met (buses width, ...).
Then enable in the buildroot config from the repo (buildroot/configs/litex_vexriscv_defconfig
) where you enable BR2_RISCV_ISA_CUSTOM_RVF and BR2_RISCV_ISA_CUSTOM_RVD (and optionally BR2_RISCV_ISA_CUSTOM_RVC for compressed instructions), and use BR2_RISCV_ABI_ILP32D=y (and BR2_RISCV_ABI_ILP32=n) so you get the hardfloat ABI instead of the soft-float one. You probably also need to have a more capable cross-compiler (I use a native one but that's another story) by enabling e.g.:
BR2_GCC_VERSION_10_X=y
BR2_GCC_ENABLE_OPENMP=y
BR2_INSTALL_LIBSTDCPP=y
BR2_TOOLCHAIN_BUILDROOT_CXX=y
BR2_TOOLCHAIN_BUILDROOT_FORTRAN=y
Then enable the FPU in the Linux kernel (buildroot/board/litex_vexriscv/linux.config
), the first line depends whether you selected C in buildroot:
CONFIG_RISCV_ISA_C=y
CONFIG_FPU=y
CONFIG_SMP=y
Then rebuild the buildroot from scratch.
You probably need a matching OpenSBI as well, so I suggest to recompile it as well.
I think that should do it and enable you to run hard-float binaries on the SoC; just make sure the DTB/OpenSBI/kernel Image/buildroot all matches.
@rdolbeau
I seem to have stumbled upon a bug, maybe hardware
That's scarry ^^
Could there be some unreliability of the atomics when I push the caches/memory to the limit like this ?
This could be a hardware bug, there is for a example of bug we already had by the past (and fixed, unless it isn't totaly fixed) :
1) Each CPU need to know when the write it has issue are visible by other CPU 2) litedram do not provide an "active feedback" when a write is done, so far we asume that when the write request leave the litedram slave interface it is visible (is it realy ?)
This kind of stuff can lead to memory fences to not be properly applied, leading to memory consistancy issues. especialy when things get quite busy.
So far i'm not aware of any issues, but you may have trigger one.
Then the questions are :
Also a few things about this kind of cases to help debugging :
Maybe we should write some self tested code heavely testing atomics in the linux user space. That's not something i have much so far XD
What is BUG: scheduling while atomic ?
@Dolu1990 As I mentioned in an edit, after a reboot the issue has disappeared despite the fact it was there during the two previous boots (unchanged FPGA configuration/bitsteam). I was able to run the benchmark multiple times with no crash or internal failures since.
The only "issue" was that performance on optimized Triad quad-threads would drop from ~220 MB/s to ~173 MB at some minor changes in OpenMP configuration (i.e. simply setting the variable that enables showing all OpenMP variables for debug, which doesn't affect execution in any way, triggered the drop). However, the OpenMP runtime in GCC is very new on RV32 so that's probably just teething issue (I tried recompiling the LLVM runtime but it doesn't support RV32). That's a purely software concern.
I can't tell for the first boot where I saw the issue, but the second one had an early error with the kernel error referenced above (BUG: scheduling while atomic
), which might have been the cause of all the subsequent issues rather than be a symptom - OpenMP rely on kernel locks for some stuff, if those are corrupted somehow it can cause such weird transient failures.
Would having the PMPs enabled helps protect the kernel from rogue accesses to memory ? (I don't see PmpPlugin has an option for the Litex SoC). I don't even know if the PMPs would be used by current Linux kernel...
Sorry, I've yet to try the Saxon SoC so I can't tell.
As of now I wouldn't classify it as a hardware bug, it could easily be a software bug that corrupted the kernel somehow (HW bug don't disappear between reboots, kernel corruption usually do). If it comes back I'll try to characterize it but it's tough as I can't get GDB to be useful (didn't manage to enable enough breakpoint support to get the 'watch' command to work).
Edit: forgot to say: BUG: scheduling while atomic
is a kernel message and I've no idea of the cause...
The only "issue" was that performance on optimized Triad quad-threads would drop from ~220 MB/s to ~173 MB at some minor
To me, it can be explained by some bad luck with alignement of instruction/data ending up into more cash trashing. VexRiscv is quite subject to that, as it has a low number of way.
OpenMP rely on kernel locks for some stuff
<3
Would having the PMPs enabled helps protect the kernel from rogue accesses to memory ? (I don't see PmpPlugin has an option for the Litex SoC). I don't even know if the PMPs would be used by current Linux kernel...
The only thing the PMP would protect in our cases, is avoiding opensbi code/data being accessed by the kernel. So the usecase of the PMP for us is very very limited (can be usefull in others cases / security / multiple supervisor running)
Linux will never configure the PMP, that's something done at the machine mode level in opensbi as far as i have seen.
@rdolbeau
Sorry, I've yet to try the Saxon SoC so I can't tell.
No worries ^^ That's just a good thing to have another platform to cross check if bugs perssists. Still quite a lot of work to setup both flow and cross check. That's just in case of one day we have something realy bad happening.
HW bug don't disappear between reboots
Yes they can if your karma is too bad ^^ So far, to me, the worst hardware bug was taking randomly up to 1 hour to appears, so this kind of things won't necessarly appear in a recurent manner. But right, normaly, they take maximum 30 seconds.
If it comes back I'll try to characterize it but it's tough as I can't get GDB to be useful
Having a software GDB ? (no jtag involved) ? I would like to try that once. I did some changes recently in the DebugPlugin to allow this to propely work, but it need to enable the ebreak support in the CsrPlugin config, did you activated it ?
@rdolbeau I don't know if that something interresting for you, but i implemented USB host OHCI controller. That works well on linux, x11 is also running well :
It isn't ported on litex so far, but it is in the plan :)
forgot to say:
BUG: scheduling while atomic
is a kernel message and I've no idea of the cause...
This is reported by the kernel if schedule() is called for atomic context, which is either a kernel bug or caused by corruption. Atomic context is for example in a IRQ handler or while interrupts/preemption are disabled.
@jluebbe Thanks, it pushes toward the idea of kernel corruption rather than hardware bug.
@Dolu1990 Yes, software GDB. Tried adding ebreak to the CSR plugin and some hardware breakpoints, but 'watch' still didn't work. I may easily have messed it, I don't really understand how that all works. Also GDB may be buggy; the version compiled by Buildroot has some weird warnings but can do some things, while the version I've recompiled from the B-enabled sources is useless, it won't even initialize properly).
Also, nice X11 :-) I've got it working as well (with WindowMaker rather than TWM :-) ) on the Litex FB, using my PS/2 controller but I only have a keyboard no mouse (yet, an ex-colleague has one in his attic...). USB host would be great, but doesn't that requires a PHY of some sort? Could it work with any FPGA board? (main reason to use PS/2 was, it"s a fairly simple 2-pins protocol and there's a Pmod with the right connector).
@rdolbeau
WindowMaker
Hooo i will have to try that :D I was looking at packages to try those last days XD
USB host would be great, but doesn't that requires a PHY of some sort
If you stay at USB 1.1 (1.5 + 12 Mbps) you only realy need 2 output of the FPGA + 2. No phy involved. I made that PMOD to get 4 ports : https://github.com/Dolu1990/pmod_usb_host_x4/blob/main/pmods.pdf
It add a current limitation + clamping diodes, but both are not sooooo important ^^
Could it work with any FPGA board
Yes, as far as i can see. Also, the CPU usage overhead is pretty low.
main reason to use PS/2 was, it"s a fairly simple 2-pins protocol and there's a Pmod with the right connector
Right, for USB host, i gess there is no PMOD that you can commercialy order as this is too specific.
@Dolu1990 Do you plan to make a batch of those Pmods? Because now I want one :-) USB 1.1 is vastly fast enough for keyboard/mouse and probably even some basic USB storage. And with a powered USB hub, you don't even have to worry about the power supply of the board. It's way better than PS/2 :-)
Hi, @rdolbeau , @Dolu1990:
for USB host, in my FPGA board, I use USB3300, ULPI interface, for USB2.0. FT602Q for USB3.0. Is this design right or not?
BTW, for @Dolu1990's PMOD 4 USB ports, the OHCI is implemented by using migen or integrate verilog code base ?
BR, Sanada
The only "issue" was that performance on optimized Triad quad-threads would drop from ~220 MB/s to ~173 MB at some minor
To me, it can be explained by some bad luck with alignement of instruction/data ending up into more cash trashing. VexRiscv is quite subject to that, as it has a low number of way.
OpenMP rely on kernel locks for some stuff
<3
Would having the PMPs enabled helps protect the kernel from rogue accesses to memory ? (I don't see PmpPlugin has an option for the Litex SoC). I don't even know if the PMPs would be used by current Linux kernel...
The only thing the PMP would protect in our cases, is avoiding opensbi code/data being accessed by the kernel. So the usecase of the PMP for us is very very limited (can be usefull in others cases / security / multiple supervisor running)
Linux will never configure the PMP, that's something done at the machine mode level in opensbi as far as i have seen.
@rdolbeau
Sorry, I've yet to try the Saxon SoC so I can't tell.
No worries ^^ That's just a good thing to have another platform to cross check if bugs perssists. Still quite a lot of work to setup both flow and cross check. That's just in case of one day we have something realy bad happening.
HW bug don't disappear between reboots
Yes they can if your karma is too bad ^^ So far, to me, the worst hardware bug was taking randomly up to 1 hour to appears, so this kind of things won't necessarly appear in a recurent manner. But right, normaly, they take maximum 30 seconds.
If it comes back I'll try to characterize it but it's tough as I can't get GDB to be useful
Having a software GDB ? (no jtag involved) ? I would like to try that once. I did some changes recently in the DebugPlugin to allow this to propely work, but it need to enable the ebreak support in the CsrPlugin config, did you activated it ?
@rdolbeau I don't know if that something interresting for you, but i implemented USB host OHCI controller. That works well on linux, x11 is also running well :
It isn't ported on litex so far, but it is in the plan :)
Hi, @Dolu1990 :
For X window on ARTY, would you like to share the rootfs or provide some tutorial to show how to make the GUI work?
BR, Sanada
It is very detail. after my XC7A200T board is ready. I want to repeat your test on my board. BTW, does the linux kernel config and toolchain both enbale FPU support?
In the current linux-on-litex-vexriscv, no, you need to enable things by hand mostly - the default configs assume you're running without FPU (or compressed instructions).
You need to enable the FPU in the bitstream with
--with-fpu
(and optionally the number of FPUs with--cpu-per-fpu
as they are shared). This should change other parameters to make sure all requirements are met (buses width, ...).Then enable in the buildroot config from the repo (
buildroot/configs/litex_vexriscv_defconfig
) where you enable BR2_RISCV_ISA_CUSTOM_RVF and BR2_RISCV_ISA_CUSTOM_RVD (and optionally BR2_RISCV_ISA_CUSTOM_RVC for compressed instructions), and use BR2_RISCV_ABI_ILP32D=y (and BR2_RISCV_ABI_ILP32=n) so you get the hardfloat ABI instead of the soft-float one. You probably also need to have a more capable cross-compiler (I use a native one but that's another story) by enabling e.g.:BR2_GCC_VERSION_10_X=y BR2_GCC_ENABLE_OPENMP=y BR2_INSTALL_LIBSTDCPP=y BR2_TOOLCHAIN_BUILDROOT_CXX=y BR2_TOOLCHAIN_BUILDROOT_FORTRAN=y
Then enable the FPU in the Linux kernel (
buildroot/board/litex_vexriscv/linux.config
), the first line depends whether you selected C in buildroot:CONFIG_RISCV_ISA_C=y CONFIG_FPU=y CONFIG_SMP=y
Then rebuild the buildroot from scratch.
You probably need a matching OpenSBI as well, so I suggest to recompile it as well.
I think that should do it and enable you to run hard-float binaries on the SoC; just make sure the DTB/OpenSBI/kernel Image/buildroot all matches.
Hi, @rdolbeau :
Thank you!! for the bitstream, linux kernel, rootfs, toolchain are all FPU enabled. but I have to check the setting is the same as you mentioned. for opensbi, I didn't rebuild it with fpu enabled. I think follow your suggestion to rebuild it is the safe one.
BR, Sanada
@rdolbeau
Do you plan to make a batch of those Pmods?
So my plan with that pmod is to have a proper way to test the usb stuff. Didn't realy planned to produce batches of it as i'm not realy great at logistic XD. Did you already ordered PCB / parts and mounted them by the past ?
even some basic USB storage.
So, so far, there is my compatibility list :
WindowMaker
I didn't found it as a buildroot package, so i assum you builded asside right ?
I'm quite a noob when it is about software build flow. so i tried :
git clone https://github.com/window-maker/wmaker
cd wmaker
export PATH=PATH_TO_GCC_TOOLCHAIN:$PATH
./configure --host=riscv32-buildroot-linux-gnu --prefix=PATH_TO_ROOT/usr/local
make
make install
But "make install" end up with a
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: skipping incompatible /lib32/libc.so.6 when searching for /lib32/libc.so.6
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: cannot find /lib32/libc.so.6
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: cannot find /usr/lib/libc_nonshared.a
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: cannot find /lib/ld-linux-riscv32-ilp32d.so.1
collect2: error: ld returned 1 exit status
libtool: error: error: relink 'libWINGs.la' with the above command before installing it
Did you had a similar issue ? Seems like it try to link the host binary by looking at my own PC libraries XD I will try to do a proper buildroot package XD
@SanadaShinken
for USB host, in my FPGA board, I use USB3300, ULPI interface, for USB2.0. FT602Q for USB3.0. Is this design right or not?
So, currently, i skiped some design phase to get the usb running without any phy, the design i did do not support external phy yet, but this should be possible in the future.
For X window on ARTY, would you like to share the rootfs or provide some tutorial to show how to make the GUI work?
Yes, my endgoal is to document that. But so far i'm trying to stabilise the config. I guess having a proper doc explaining / pointing to all the different config and their interaction woulde be usefull instead of swiming in a ocean of unknown stuff XD
@SanadaShinken
would you like to share the rootfs
i can do that too, let me just a few days to stabilise things ^^
Did you already ordered PCB / parts and mounted them by the past ?
PCB yes, I made myself an adapter to plug an FPGA in the SBus slot of a 90s SPARCstation - but it was fairly simple design. Then I used SeeedStudio PCBA, they assembled/soldered all of it as I couldn't solder properly to save my life even back when my eyes were still working properly.
I just added some part numbers in Kicad (no idea what the ferrite should be or if the big capacitor are ceramic or tantalum or ... but for estimating the price it doesn't matter), and Seeed says about 280+ euros for 5 boards... assembly is expensive :-(
I didn't found it as a buildroot package, so i assum you builded asside right ?
Yes, and not just it :-) I self-hosted almost all the dependencies (thanks https://www.linuxfromscratch.org !), including Perl, Python, cmake, Xorg, ... using a B-enabled compiler (cross-compiled, as is the kernel itself). Roughly 1% of all instructions in binaries and libraries are from B; sh1add/sh2add/sh3add (they do a+(b<<n) with n=1,2,3, very useful for address computations; together they form Zba, a subset of B) see a lot of use.
I'm quite a noob when it is about software build flow. so i tried : (...) Did you had a similar issue ?
You need a cross-environment with the proper libraries, which is always complicated. That's why buildroot or yocto exists - hide the mess... I avoid cross-compiling as much as I can, except for bootstrapping a native environment as there's no other option.
So I did not have the issue because I cheated: Litex/VexRiscv compiled its own stuff :-) That's why I've filled the FPGA with cores.
My compiler was cross-compiled using the Buildroot cross-compiler with ../gcc-10.2.0/configure --prefix=/usr/local --host=riscv32-buildroot-linux-gnu --target=riscv32-buildroot-linux-gnu --with-build-sysroot=/mnt --disable-multilib --enable-languages=c,c++,fortran
, the micro-sd card root was mounted on /mnt. Then the binutils in a similar way. From there it just took the poor SoC a lot (and lot and lot) of time. And a small 20x20x10mm heatsink on top of the FPGA just in case :-) Quite a few packages were already in the buildroot, but most of them were recompiled anyway with the 'proper' B-enabled compiler.
@rdolbeau
There is what i ordered as components to mouser (do not include resistor above 27 ohm resistors). The quantity are over provisioned. I have enough material in stock to mount a second board and ship it hidden in a chocolate box. usbx4order.txt
The only "down-side" of the actual design is that you have to provide 5V to the pmod via the 4 pins connector.
and Seeed says about 280+ euros for 5 boards... assembly is expensive :-(
I never tried external assembly, realy had no idea of the cost.
Ahh thanks, didn't knew those ressources :D
So I did not have the issue because I cheated: Litex/VexRiscv compiled its own stuff :-) That's why I've filled the FPGA with cores.
OMG i a good laugh XD I will try to push stuff as a buildroot package ^^
My compiler was cross-compiled using the Buildroot cross-compiler with
That's something i also wanted to try !
@rdolbeau There is what i ordered as components to mouser
Thanks for the BOM. I'm guessing the long 26p header and the 47uF capa are for something else? (they don't appear on the schematic).
The only "down-side" of the actual design is that you have to provide 5V to the pmod via the 4 pins connector.
So does the PS/2 mod, not sure many keyboards would work with just 3.3V. Fortunately the Wukong expose the 5V input on one of the headers.
I never tried external assembly, realy had no idea of the cost.
For very small volume of small product it's not worth it - even naked PCBs.
For this particular Pmod, PCB is 34.9€ for 5 or 10 (!), and 62.26€ for 100... Adding assembly is >260€ for 5 (seeed has a couple of reference missing so it would be a bit more really), >420€ for 10, and >2000€ for 100. It seems those big 150uF capacitors are expensive, as are the connectors. There's about 65€ of fixed-cost, and above a threshold some fees are removed, so mid-volume pricing is more reasonable - but ~25€ per item shipped is probably still too much.
For a very small volume, you need to assemble it yourself. Or you need to make enough that you can amortize the fixed costs. Guess I'll stick with PS/2 for now - unless you manage to convince Digilent to put the Pmod in production :-)
Thanks for the BOM. I'm guessing the long 26p header and the 47uF capa are for something else? (they don't appear on the schematic).
Hoo right, that was some provision, and the 47 uf was a mistake XD You just need something to solder on the little 4 pins 2.54 spaced connector
Fortunately the Wukong expose the 5V input on one of the headers.
Same for ArtyA7. that's not ideal, but that was this or adding some step up chip. I locked for some afordable once, but they where all in some shity stuff to solder, excepted one which was out of stock until 2022 XD. So i just gived up on that.
not sure many keyboards would work with just 3.3V.
I guess it wont work.
For a very small volume, you need to assemble it yourself. Or you need to make enough that you can amortize the fixed costs. unless you manage to convince Digilent to put the Pmod in production :-)
Right XD
@SanadaShinken
would you like to share the rootfs
i can do that too, let me just a few days to stabilise things ^^
Hi, @Dolu1990 :
Many Thanks. I'll try what you talk about building windowmaker. If got some workable flow, I'll share the flow.
BR, Sanada
@SanadaShinken There it is : https://drive.google.com/file/d/1Ujr5UWIy7ArFIWtd7HkI6QyaRoG1Z2Ki/view?usp=sharing
Generated from : https://github.com/SpinalHDL/buildroot-spinal-saxon/blob/usb/configs/saxon_arty_a7_35_defconfig
Dear Sir:
Does any have the vexriscv smp boot up fail problem. the code base which newer than 2021 0401 seems boot up failed if CPU counts > 1.
If setting cpus counts to 1 , the boot up is very good. It could launch Linux kernel and RootFS.
where do I have to check? Many Thanks.
BR, Sanada