litex-hub / linux-on-litex-vexriscv

Linux on LiteX-VexRiscv
BSD 2-Clause "Simplified" License
574 stars 174 forks source link

about vexriscv smp boot up failed #229

Open SanadaShinken opened 3 years ago

SanadaShinken commented 3 years ago

Dear Sir:

Does any have the vexriscv smp boot up fail problem. the code base which newer than 2021 0401 seems boot up failed if CPU counts > 1.

If setting cpus counts to 1 , the boot up is very good. It could launch Linux kernel and RootFS.

where do I have to check? Many Thanks.

BR, Sanada

Dolu1990 commented 3 years ago

Hi,

On which board are you ? Did you checked the place and route routing report ?

BR Charles

SanadaShinken commented 3 years ago

Hi Charles:

I had tested on QMtech Wukong board and our own fpga board. both fpga chip of the boards is XC7A100T, only the gpios are different.

about P&R, I had not checked. Could you do me a favour to tell me how to check P&R result.

BTW, If I enable FPU support on One CPU, the result is the same as multiple CPUs.

BR, Sanada

jluebbe commented 3 years ago

Do you mean that even the BIOS is not starting with --cpu-count 2 (while it works fine with single core)?

If so, I see the same issue on the ecpix5. The log looks fine: both utilization (TRELLIS_SLICE: 16826/41820 40%) and max frequency ('$glbnet$sdrio_clk': 65.02 MHz (PASS at 50.00 MHz)). I don't have a good idea where to continue debugging.

Dolu1990 commented 3 years ago

First thing to do when stuff is broken, is to check if it run in the litex simulation with a similar config. (https://github.com/litex-hub/linux-on-litex-vexriscv#running-the-litex-simulation)

Can you give a try ? This is the main factor which will descide how to debug it.

SanadaShinken commented 3 years ago

Do you mean that even the BIOS is not starting with --cpu-count 2 (while it works fine with single core)?

If so, I see the same issue on the ecpix5. The log looks fine: both utilization (TRELLIS_SLICE: 16826/41820 40%) and max frequency ('$glbnet$sdrio_clk': 65.02 MHz (PASS at 50.00 MHz)). I don't have a good idea where to continue debugging.

Hi @jluebbe :

Yes!!! You are right. the BIOS is not booting in 2 CPUs configuration. I'll check the P&R's result.

BR, Sanada

SanadaShinken commented 3 years ago

First thing to do when stuff is broken, is to check if it run in the litex simulation with a similar config. (https://github.com/litex-hub/linux-on-litex-vexriscv#running-the-litex-simulation)

Can you give a try ? This is the main factor which will descide how to debug it.

Hi @Dolu1990 :

OK! I see. I'll do some tests under simulation configuration with different CPU counts. Many Thanks.

BR, Sanada

SanadaShinken commented 3 years ago

Hi @Dolu1990 :

the log file ,sim_cpux4.log, is the "./sim.py --cpu-counts 4" outcome. sim_cpux4.log

the log file ,sim_cpux1.log, is the "./sim.py --cpu-counts 1" outcome. sim_cpux1.log

the image is the prebuild image. https://github.com/litex-hub/linux-on-litex-vexriscv/issues/164

in sim_cpux4.log, the loading process seems stop at "[ 0.422129] Unpacking initramfs..."

BR, Sanada

Dolu1990 commented 3 years ago

@SanadaShinken

Thanks

how long did you run the x4 sim compared to the x1 sim ? the thing is that the x4 is significantly slower to simulate, and the unpack is taking quite a long time :

[ 0.428819] Unpacking initramfs... [ 3.431514] Freeing initrd memory: 8192K

anyway, i will give a try tomorrow on artyA7 35T 2 cores

SanadaShinken commented 3 years ago

@SanadaShinken

Thanks

how long did you run the x4 sim compared to the x1 sim ? the thing is that the x4 is significantly slower to simulate, and the unpack is taking quite a long time :

[ 0.428819] Unpacking initramfs... [ 3.431514] Freeing initrd memory: 8192K

anyway, i will give a try tomorrow on artyA7 35T 2 cores

Hi, @Dolu1990 :

Many Thanks.

about 3 houres waiting for 4 cpus simulation. about 1 houres for 1 cpu simulation.

BR, Sanada

Dolu1990 commented 3 years ago

after updateding the litex tool, I just tried :

./make.py --cpu-count 2 --local-ip=192.168.0.159 --remote-ip=192.168.0.24 --build --load --board=arty

This worked fine on my Arty A7 35T

Can you try on your board, especialy with --cpu-count 2 ?

SanadaShinken commented 3 years ago

Hi @Dolu1990 :

The result is the same. still fail. no BIOS output.....

BR, Sanada

jluebbe commented 3 years ago

After some help by @enjoy-digital on #litex I was able to build a working dual core image. The trick was to disable the l2_size: https://github.com/jluebbe/linux-on-litex-vexriscv/commit/dd2d33f7a9b688643ac1d57dc3c1fba8490aff90

Dolu1990 commented 3 years ago

@jlubbe > "l2_size"

Can you tell me more about this ? i'm not aware why there was some bad interrations.

jluebbe commented 3 years ago

@jlubbe > "l2_size"

Can you tell me more about this ? i'm not aware why there was some bad interrations.

I don't know the actual cause either. This change was proposed by @enjoy-digital in the IRC channel (florent). There are logs here: https://freenode.irclog.whitequark.org/litex/2021-05-04#29825181;

enjoy-digital commented 3 years ago

@Dolu1990: @rdolbeau and I thought about this because it was the main difference between Arty and ECPIX-5 but I haven't investigated more yet. I'll try to understand. If this related to the wishbone interface + cpu_count > 1, we should also be able to reproduce the behavior on other boards.

rdolbeau commented 3 years ago

@SanadaShinken The Qmtech Wukong works fine in SMP for me (4 VexRiscv with extra instructions) - but I don't use the L2 cache and Wishbone interface, instead I use the native litedram, as mentioned by @enjoy-digital ; my codebase is from around mid-april. in make.py my board looks like this:

class qmtech_wukong(Board):
    SPIFLASH_PAGE_SIZE    = 256
    SPIFLASH_SECTOR_SIZE  = 64*kB
    SPIFLASH_DUMMY_CYCLES = 7
    soc_kwargs = {
        "sys_clk_freq": 100e6,
        "with_video_framebuffer": True,
        "video_timing": "800x600@60Hz",
        "ps2kbd": True
    }
    def __init__(self):
        from litex_boards.targets import qmtech_wukong
        Board.__init__(self, qmtech_wukong.BaseSoC, soc_capabilities={
            "serial",
            "ethernet",
            "sdcard",
            "leds",
            "icap_bitstream",
        }, bitstream_ext=".bit")
SanadaShinken commented 3 years ago

@SanadaShinken The Qmtech Wukong works fine in SMP for me (4 VexRiscv with extra instructions) - but I don't use the L2 cache and Wishbone interface, instead I use the native litedram, as mentioned by @enjoy-digital ; my codebase is from around mid-april. in make.py my board looks like this:

class qmtech_wukong(Board):
    SPIFLASH_PAGE_SIZE    = 256
    SPIFLASH_SECTOR_SIZE  = 64*kB
    SPIFLASH_DUMMY_CYCLES = 7
    soc_kwargs = {
        "sys_clk_freq": 100e6,
        "with_video_framebuffer": True,
        "video_timing": "800x600@60Hz",
        "ps2kbd": True
    }
    def __init__(self):
        from litex_boards.targets import qmtech_wukong
        Board.__init__(self, qmtech_wukong.BaseSoC, soc_capabilities={
            "serial",
            "ethernet",
            "sdcard",
            "leds",
            "icap_bitstream",
        }, bitstream_ext=".bit")

Hi @rdolbeau :

Very Thanks for your setting sharing. Befor April 1st, the 4 cores run very smooth. after April 1st., the 4 cores going done. BTW, the system bus which @rdolbeau used is AXI lite or not?

Before April 1st, I could build 4 CPU + 4 FPU + AES instruction + Video Framebuffer enabled on QMTech Wukong Board. please see the attachment picture is the P&R result which got from VIVADO 2020.1.

photo

BR, Sanada

rdolbeau commented 3 years ago

@SanadaShinken Try without the L2 then, it should work fine.

I have no idea what the system bus is, it's whatever Litex has by default with the configuration above, think it is still wishbone but the DRAM is connected differently without the L2? @enjoy-digital will know better...

SanadaShinken commented 3 years ago

Hi @rdolbeau :

Many Thanks. Set L2 size to zero, every thing is all right.

BTW, I found a very interesting point.

The DDR3 memory chip is MT41K512M16HA, 1Gbytes.

If set L2 size to 8192, the memory speed is about --=============== SoC ==================-- CPU: VexRiscv SMP-LINUX @ 100MHz BUS: WISHBONE 32-bit @ 4GiB CSR: 32-bit data ROM: 64KiB SRAM: 8KiB L2: 8KiB SDRAM: 1048576KiB 16-bit @ 800MT/s (CL-6 CWL-5) --========== Initialization ============-- Write speed: 31MiB/s Read speed: 20MiB/s

If set L2 size to zero, the memory speed is about: --=============== SoC ==================-- CPU: VexRiscv SMP-LINUX @ 100MHz BUS: WISHBONE 32-bit @ 4GiB CSR: 32-bit data ROM: 64KiB SRAM: 8KiB L2: 0KiB SDRAM: 1048576KiB 16-bit @ 800MT/s (CL-6 CWL-5) --========== Initialization ============-- Write speed: 31MiB/s Read speed: 26MiB/s

the Write operqation seems not effected by L2 cache setting. but the Read operation is effected. Any Idea?

BR, Sanada

Dolu1990 commented 3 years ago

@SanadaShinken

the Write operqation seems not effected by L2 cache setting. but the Read operation is effected. Any Idea?

I gess that :

For write operations, VexRiscv do not need to wait for the completion of the write, so cache miss in the L2 due to a write will only stall the CPU if the through the DBus cmd stream, which has quite some buffers on the way from the l1 to the l2 => no stall

For the read, basicaly, a i guess there is some penality to refill the L2, and then the L1, instead of directly refilling the L1

SanadaShinken commented 3 years ago

@SanadaShinken

the Write operqation seems not effected by L2 cache setting. but the Read operation is effected. Any Idea?

I gess that :

For write operations, VexRiscv do not need to wait for the completion of the write, so cache miss in the L2 due to a write will only stall the CPU if the through the DBus cmd stream, which has quite some buffers on the way from the l1 to the l2 => no stall

For the read, basicaly, a i guess there is some penality to refill the L2, and then the L1, instead of directly refilling the L1

Hi @Dolu1990 :

Thank you for your explanation. I think you are right. It seems the operation cost is not the same between write and read. under cache enable or not, the speed of write is the same. for read, the extra cost reduce the speed.

BTW, the memory access speed is normal or not ? I'm very wonder for this performance issue.

BR, Sanada

rdolbeau commented 3 years ago

@SanadaShinken The BIOS speed measurement might not be representative of how much bandwidth you can use, not even from one core. On my Qmtech Wukong with four 100 MHz core sharing a single FPU and using the STREAM benchmark (FORTRAN version with reduced memory footprint but otherwise unchanged), I get for 1/2/4 threads (somewhat abbreviated output):

The total memory requirement is   45 MB
Number of Threads =            1
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:               61.3242            0.5232            0.5218            0.5242
Scale:              43.6510            0.7349            0.7331            0.7367
Add:                56.2504            0.8550            0.8533            0.8562
Triad:              48.3942            0.9936            0.9919            0.9957
(...)
Number of Threads =            2
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:               85.1240            0.3769            0.3759            0.3786
Scale:              72.1662            0.4446            0.4434            0.4465
Add:                92.8279            0.5178            0.5171            0.5188
Triad:              86.7946            0.5547            0.5530            0.5559
(...)
Number of Threads =            4
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:               93.6368            0.3461            0.3417            0.3502
Scale:              89.0219            0.3616            0.3595            0.3649
Add:               110.2635            0.4375            0.4353            0.4415
Triad:             108.0181            0.4467            0.4444            0.4507

STREAM uses double-precision value, so each load/store addresses 8 bytes which might be more efficient than what the BIOS does. Absolute numbers may vary depending on other parameters (I have 16 KiB caches and an expanded DTLB), but you very likely need more than one core to saturate the bus/interface - I'm not sure what the bottleneck(s) is/are.

Dolu1990 commented 3 years ago

@SanadaShinken

BTW, the memory access speed is normal or not ? I'm very wonder for this performance issue.

Would need a benchmark which unrool the read/write loop. I guess this would make a significant difference.

SanadaShinken commented 3 years ago

@SanadaShinken The BIOS speed measurement might not be representative of how much bandwidth you can use, not even from one core. On my Qmtech Wukong with four 100 MHz core sharing a single FPU and using the STREAM benchmark (FORTRAN version with reduced memory footprint but otherwise unchanged), I get for 1/2/4 threads (somewhat abbreviated output):

The total memory requirement is   45 MB
Number of Threads =            1
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:               61.3242            0.5232            0.5218            0.5242
Scale:              43.6510            0.7349            0.7331            0.7367
Add:                56.2504            0.8550            0.8533            0.8562
Triad:              48.3942            0.9936            0.9919            0.9957
(...)
Number of Threads =            2
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:               85.1240            0.3769            0.3759            0.3786
Scale:              72.1662            0.4446            0.4434            0.4465
Add:                92.8279            0.5178            0.5171            0.5188
Triad:              86.7946            0.5547            0.5530            0.5559
(...)
Number of Threads =            4
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:               93.6368            0.3461            0.3417            0.3502
Scale:              89.0219            0.3616            0.3595            0.3649
Add:               110.2635            0.4375            0.4353            0.4415
Triad:             108.0181            0.4467            0.4444            0.4507

STREAM uses double-precision value, so each load/store addresses 8 bytes which might be more efficient than what the BIOS does. Absolute numbers may vary depending on other parameters (I have 16 KiB caches and an expanded DTLB), but you very likely need more than one core to saturate the bus/interface - I'm not sure what the bottleneck(s) is/are.

Hi @rdolbeau :

Thank you very much to share such valuable benchmark. This could be the base of testing the sw code and the hw uint. Very useful !!

BR, Samada

SanadaShinken commented 3 years ago

@SanadaShinken

BTW, the memory access speed is normal or not ? I'm very wonder for this performance issue.

Would need a benchmark which unrool the read/write loop. I guess this would make a significant difference.

Hi @Dolu1990 :

It's very kind to tell me to take care the loop unrolling. If I hove some time, I'll do some test and share in this project.

BR, Sanada

rdolbeau commented 3 years ago

@Dolu1990 @SanadaShinken Adding -funroll-loops when compiling STREAM with gfortran does offer some benefits, diminishing with the number of threads - for one threads up to 10-12%, for 4 threads, it's barely noticeable. At some point the shared FPU might become a bottleneck, as it's doing most of the work. For Triad the inner loop with no unrolling is:

   10e7e:       221c                    fld     fa5,0(a2)
   10e80:       2198                    fld     fa4,0(a1)
   10e82:       06a1                    addi    a3,a3,8
   10e84:       0785                    addi    a5,a5,1
   10e86:       72f6f7c3                fmadd.d fa5,fa3,fa5,fa4
   10e8a:       05a1                    addi    a1,a1,8
   10e8c:       0621                    addi    a2,a2,8
   10e8e:       fef6bc27                fsd     fa5,-8(a3)
   10e92:       fee796e3                bne     a5,a4,10e7e <MAIN__._omp_fn.7+0x48>

4 out of 9 instructions have to be handled by the shared FPU including the load/stores. Not sure using 2 FPUs would fit in my FPGA, but it might improve throughput for the 4-threads case.

EDIT: 2 FPUs fits, and they do offer some benefits for STREAM, Triad goes all the way up to 114 MB/s (unrolled or not).

For reference with -funroll-loops it becomes:

   11612:       0000b087                fld     ft1,0(ra)
   11616:       0005b007                fld     ft0,0(a1)
   1161a:       04070713                addi    a4,a4,64
   1161e:       04008093                addi    ra,ra,64
   11622:       0a07f143                fmadd.d ft2,fa5,ft0,ft1
   11626:       04058593                addi    a1,a1,64
   1162a:       0321                    addi    t1,t1,8
   1162c:       fc273027                fsd     ft2,-64(a4)
   11630:       fc85b187                fld     ft3,-56(a1)
   11634:       fc80b207                fld     ft4,-56(ra)
   11638:       2237f2c3                fmadd.d ft5,fa5,ft3,ft4
   1163c:       fc573427                fsd     ft5,-56(a4)
   11640:       fd05b307                fld     ft6,-48(a1)
   11644:       fd00b387                fld     ft7,-48(ra)
   11648:       3a67f543                fmadd.d fa0,fa5,ft6,ft7
   1164c:       fca73827                fsd     fa0,-48(a4)
   11650:       fd85b587                fld     fa1,-40(a1)
   11654:       fd80b607                fld     fa2,-40(ra)
   11658:       62b7f843                fmadd.d fa6,fa5,fa1,fa2
   1165c:       fd073c27                fsd     fa6,-40(a4)
   11660:       fe05b887                fld     fa7,-32(a1)
   11664:       fe00be07                fld     ft8,-32(ra)
   11668:       e317fec3                fmadd.d ft9,fa5,fa7,ft8
   1166c:       ffd73027                fsd     ft9,-32(a4)
   11670:       fe85bf07                fld     ft10,-24(a1)
   11674:       fe80bf87                fld     ft11,-24(ra)
   11678:       fbe7f743                fmadd.d fa4,fa5,ft10,ft11
   1167c:       fee73427                fsd     fa4,-24(a4)
   11680:       ff05b007                fld     ft0,-16(a1)
   11684:       ff00b687                fld     fa3,-16(ra)
   11688:       6a07f0c3                fmadd.d ft1,fa5,ft0,fa3
   1168c:       fe173827                fsd     ft1,-16(a4)
   11690:       ff85b107                fld     ft2,-8(a1)
   11694:       ff80b187                fld     ft3,-8(ra)
   11698:       1a27f243                fmadd.d ft4,fa5,ft2,ft3
   1169c:       fe473c27                fsd     ft4,-8(a4)
   116a0:       f7e319e3                bne     t1,t5,11612 <MAIN__._omp_fn.7+0x134>
rdolbeau commented 3 years ago

Obviously, on a one-instruction-per-cycle in-order core, GCC-s default unrolling isn't great. Replacing the Triad assembly loop by a properly pipelined one (address computation & loop bound checking can probably be optimized more):

        fld     ft0,0(a1)
        fld     ft1,0(ra)
    addi    a1,a1,64
    addi    ra,ra,64
        fld     ft2,-56(a1)
        fld     ft3,-56(ra)
    addi    t1,t1,8
        fld     ft4,-48(a1)
        fld     ft5,-48(ra)
    addi    a4,a4,64
        fld     ft6,-40(a1)
        fld     ft7,-40(ra)
        fld     fa0,-32(a1)
        fld     fa1,-32(ra)
        fld     fa2,-24(a1)
        fld     fa3,-24(ra)
        fld     fa4,-16(a1)
        fld     ft9,-16(ra)
        fld     fa6,-8(a1)
        fld     fa7,-8(ra)
        fmadd.d ft1,fa5,ft0,ft1
        fmadd.d ft3,fa5,ft2,ft3
        fmadd.d ft5,fa5,ft4,ft5
        fmadd.d ft7,fa5,ft6,ft7
        fmadd.d fa1,fa5,fa0,fa1
        fmadd.d fa3,fa5,fa2,fa3
        fmadd.d ft9,fa5,fa4,ft9
        fmadd.d fa7,fa5,fa6,fa7
        fsd     ft1,-64(a4)
        fsd     ft3,-56(a4)
        fsd     ft5,-48(a4)
        fsd     ft7,-40(a4)
        fsd     fa1,-32(a4)
        fsd     fa3,-24(a4)
        fsd     ft9,-16(a4)
        fsd     fa7,-8(a4)
    bne t1,t5,.L209

I get 122 MB/s for one thread and 192 MB/s for two (on cores with non-shared FPUs), despite a G++ instance running on another core (also why I don't have 4 threads result, SoC is a bit busy).

Now we need a proper machine description in GCC for those long-latency instructions :-)

SanadaShinken commented 3 years ago

@Dolu1990 @SanadaShinken Adding -funroll-loops when compiling STREAM with gfortran does offer some benefits, diminishing with the number of threads - for one threads up to 10-12%, for 4 threads, it's barely noticeable. At some point the shared FPU might become a bottleneck, as it's doing most of the work. For Triad the inner loop with no unrolling is:

   10e7e:       221c                    fld     fa5,0(a2)
   10e80:       2198                    fld     fa4,0(a1)
   10e82:       06a1                    addi    a3,a3,8
   10e84:       0785                    addi    a5,a5,1
   10e86:       72f6f7c3                fmadd.d fa5,fa3,fa5,fa4
   10e8a:       05a1                    addi    a1,a1,8
   10e8c:       0621                    addi    a2,a2,8
   10e8e:       fef6bc27                fsd     fa5,-8(a3)
   10e92:       fee796e3                bne     a5,a4,10e7e <MAIN__._omp_fn.7+0x48>

4 out of 9 instructions have to be handled by the shared FPU including the load/stores. Not sure using 2 FPUs would fit in my FPGA, but it might improve throughput for the 4-threads case.

EDIT: 2 FPUs fits, and they do offer some benefits for STREAM, Triad goes all the way up to 114 MB/s (unrolled or not).

For reference with -funroll-loops it becomes:

   11612:       0000b087                fld     ft1,0(ra)
   11616:       0005b007                fld     ft0,0(a1)
   1161a:       04070713                addi    a4,a4,64
   1161e:       04008093                addi    ra,ra,64
   11622:       0a07f143                fmadd.d ft2,fa5,ft0,ft1
   11626:       04058593                addi    a1,a1,64
   1162a:       0321                    addi    t1,t1,8
   1162c:       fc273027                fsd     ft2,-64(a4)
   11630:       fc85b187                fld     ft3,-56(a1)
   11634:       fc80b207                fld     ft4,-56(ra)
   11638:       2237f2c3                fmadd.d ft5,fa5,ft3,ft4
   1163c:       fc573427                fsd     ft5,-56(a4)
   11640:       fd05b307                fld     ft6,-48(a1)
   11644:       fd00b387                fld     ft7,-48(ra)
   11648:       3a67f543                fmadd.d fa0,fa5,ft6,ft7
   1164c:       fca73827                fsd     fa0,-48(a4)
   11650:       fd85b587                fld     fa1,-40(a1)
   11654:       fd80b607                fld     fa2,-40(ra)
   11658:       62b7f843                fmadd.d fa6,fa5,fa1,fa2
   1165c:       fd073c27                fsd     fa6,-40(a4)
   11660:       fe05b887                fld     fa7,-32(a1)
   11664:       fe00be07                fld     ft8,-32(ra)
   11668:       e317fec3                fmadd.d ft9,fa5,fa7,ft8
   1166c:       ffd73027                fsd     ft9,-32(a4)
   11670:       fe85bf07                fld     ft10,-24(a1)
   11674:       fe80bf87                fld     ft11,-24(ra)
   11678:       fbe7f743                fmadd.d fa4,fa5,ft10,ft11
   1167c:       fee73427                fsd     fa4,-24(a4)
   11680:       ff05b007                fld     ft0,-16(a1)
   11684:       ff00b687                fld     fa3,-16(ra)
   11688:       6a07f0c3                fmadd.d ft1,fa5,ft0,fa3
   1168c:       fe173827                fsd     ft1,-16(a4)
   11690:       ff85b107                fld     ft2,-8(a1)
   11694:       ff80b187                fld     ft3,-8(ra)
   11698:       1a27f243                fmadd.d ft4,fa5,ft2,ft3
   1169c:       fe473c27                fsd     ft4,-8(a4)
   116a0:       f7e319e3                bne     t1,t5,11612 <MAIN__._omp_fn.7+0x134>

Hi, @rdolbeau :

It is very detail. after my XC7A200T board is ready. I want to repeat your test on my board. BTW, does the linux kernel config and toolchain both enbale FPU support?

BR, Sanada

rdolbeau commented 3 years ago

@Dolu1990 I seem to have stumbled upon a bug, maybe hardware, when pushing things. When running 4 threads, the optimized STREAM would produce wrong results and fails validation. After triple-checking my code, I started playing with the original and realized it would occasionally fails validation as well.

I also crashed the system once:

[57663.021112] Unable to handle kernel access to user memory without uaccess routines at virtual address 00000004
[57663.024133] Oops [#1]
[57663.024576] CPU: 0 PID: 4713 Comm: stream_unrolled Not tainted 5.12.0-171771-gc8543a1ea224-dirty #3
[57663.026866] epc : pick_next_task_fair+0x13e/0x316
[57663.028116]  ra : pick_next_task_fair+0x108/0x316
[57663.029320] epc : c00302ac ra : c0030276 sp : c28f7ee0
[57663.030591]  gp : c0602de8 tp : c1501a00 t0 : 02e7a225
[57663.031826]  t1 : 00000001 t2 : 0016e360 s0 : c28f7f20
[57663.033205]  s1 : cfdd6080 a0 : c08b4300 a1 : 6f6c626b
[57663.034616]  a2 : 00000000 a3 : 00000000 a4 : ffffffff
[57663.036044]  a5 : c0877254 a6 : 00000016 a7 : 00000000
[57663.037437]  s2 : c0877240 s3 : cfdd60c0 s4 : 000002f0
[57663.038797]  s5 : aafd0498 s6 : c0877240 s7 : c1501d8c
[57663.040133]  s8 : 00000000 s9 : 35250420 s10: 34b27460
[57663.041590]  s11: 352508e0 t3 : 0000036c t4 : 02d7b621
[57663.042971]  t5 : 02e23612 t6 : c057c1d0
[57663.043812] status: 00000100 badaddr: 00000004 cause: 0000000f
[57663.045505] Call Trace:
[57663.045972] [<c00302ac>] pick_next_task_fair+0x13e/0x316
[57663.047068] [<c04521fe>] __schedule+0xb6/0x404
[57663.048152] [<c0452582>] schedule+0x36/0xa0
[57663.049248] [<c000203e>] ret_from_exception+0x0/0xc

After the reboot one of my first attempts led to this on the console (but not a kernel crash):

buildroot login: [   74.324988] BUG: scheduling while atomic: stream_unrolled/177/0xffff0000                                                                                        
[   74.326656] CPU: 0 PID: 177 Comm: stream_unrolled Not tainted 5.12.0-171771-gc8543a1ea224-dirty #3                                                                               
[   74.328846] Call Trace:                                                                
[   74.329325] [<c00033f6>] walk_stackframe+0x0/0xca                                      
[   74.330360] [<c044d570>] dump_backtrace+0x38/0x46                                      
[   74.331361] [<c044d58c>] show_stack+0xe/0x16                                           
[   74.332264] [<c04514ce>] dump_stack+0x6c/0x8a                                          
[   74.333265] [<c00257b6>] __schedule_bug+0x56/0x66                                      
[   74.334272] [<c04524ba>] __schedule+0x372/0x404                                        
[   74.335411] [<c0452582>] schedule+0x36/0xa0                                            
[   74.336480] [<c000203e>] ret_from_exception+0x0/0xc   

As it's parallelized using OpenMP, it relies on atomic instructions for various synchronizations in libgomp (mostly amoswap and amoadd). The final reduction can get messed up if synchronization fails (it also rely on cache coherency to work properly so that results from one core can be read by another).

With fewer than 4 threads, it seems to always work, but the work distribution is different and might hide/mitigate the issue.

Could there be some unreliability of the atomics when I push the caches/memory to the limit like this ? (I get about 220 MB/s from the Triad).

EDIT: after another crash & reboot, the optimized version now validates every time despite it being the exact same binary... weird. Maybe something that was corrupted in the kernel leading to non-working synchronizations? Maybe adding the PmpPlugin would help.

rdolbeau commented 3 years ago

It is very detail. after my XC7A200T board is ready. I want to repeat your test on my board. BTW, does the linux kernel config and toolchain both enbale FPU support?

In the current linux-on-litex-vexriscv, no, you need to enable things by hand mostly - the default configs assume you're running without FPU (or compressed instructions).

You need to enable the FPU in the bitstream with --with-fpu(and optionally the number of FPUs with --cpu-per-fpu as they are shared). This should change other parameters to make sure all requirements are met (buses width, ...).

Then enable in the buildroot config from the repo (buildroot/configs/litex_vexriscv_defconfig) where you enable BR2_RISCV_ISA_CUSTOM_RVF and BR2_RISCV_ISA_CUSTOM_RVD (and optionally BR2_RISCV_ISA_CUSTOM_RVC for compressed instructions), and use BR2_RISCV_ABI_ILP32D=y (and BR2_RISCV_ABI_ILP32=n) so you get the hardfloat ABI instead of the soft-float one. You probably also need to have a more capable cross-compiler (I use a native one but that's another story) by enabling e.g.:

BR2_GCC_VERSION_10_X=y
BR2_GCC_ENABLE_OPENMP=y
BR2_INSTALL_LIBSTDCPP=y
BR2_TOOLCHAIN_BUILDROOT_CXX=y
BR2_TOOLCHAIN_BUILDROOT_FORTRAN=y

Then enable the FPU in the Linux kernel (buildroot/board/litex_vexriscv/linux.config), the first line depends whether you selected C in buildroot:

CONFIG_RISCV_ISA_C=y
CONFIG_FPU=y
CONFIG_SMP=y

Then rebuild the buildroot from scratch.

You probably need a matching OpenSBI as well, so I suggest to recompile it as well.

I think that should do it and enable you to run hard-float binaries on the SoC; just make sure the DTB/OpenSBI/kernel Image/buildroot all matches.

Dolu1990 commented 3 years ago

@rdolbeau

I seem to have stumbled upon a bug, maybe hardware

That's scarry ^^

Could there be some unreliability of the atomics when I push the caches/memory to the limit like this ?

This could be a hardware bug, there is for a example of bug we already had by the past (and fixed, unless it isn't totaly fixed) :

1) Each CPU need to know when the write it has issue are visible by other CPU 2) litedram do not provide an "active feedback" when a write is done, so far we asume that when the write request leave the litedram slave interface it is visible (is it realy ?)

This kind of stuff can lead to memory fences to not be properly applied, leading to memory consistancy issues. especialy when things get quite busy.

So far i'm not aware of any issues, but you may have trigger one.

Then the questions are :

Also a few things about this kind of cases to help debugging :

Maybe we should write some self tested code heavely testing atomics in the linux user space. That's not something i have much so far XD

What is BUG: scheduling while atomic ?

rdolbeau commented 3 years ago

@Dolu1990 As I mentioned in an edit, after a reboot the issue has disappeared despite the fact it was there during the two previous boots (unchanged FPGA configuration/bitsteam). I was able to run the benchmark multiple times with no crash or internal failures since.

The only "issue" was that performance on optimized Triad quad-threads would drop from ~220 MB/s to ~173 MB at some minor changes in OpenMP configuration (i.e. simply setting the variable that enables showing all OpenMP variables for debug, which doesn't affect execution in any way, triggered the drop). However, the OpenMP runtime in GCC is very new on RV32 so that's probably just teething issue (I tried recompiling the LLVM runtime but it doesn't support RV32). That's a purely software concern.

I can't tell for the first boot where I saw the issue, but the second one had an early error with the kernel error referenced above (BUG: scheduling while atomic), which might have been the cause of all the subsequent issues rather than be a symptom - OpenMP rely on kernel locks for some stuff, if those are corrupted somehow it can cause such weird transient failures.

Would having the PMPs enabled helps protect the kernel from rogue accesses to memory ? (I don't see PmpPlugin has an option for the Litex SoC). I don't even know if the PMPs would be used by current Linux kernel...

Sorry, I've yet to try the Saxon SoC so I can't tell.

As of now I wouldn't classify it as a hardware bug, it could easily be a software bug that corrupted the kernel somehow (HW bug don't disappear between reboots, kernel corruption usually do). If it comes back I'll try to characterize it but it's tough as I can't get GDB to be useful (didn't manage to enable enough breakpoint support to get the 'watch' command to work).

Edit: forgot to say: BUG: scheduling while atomic is a kernel message and I've no idea of the cause...

Dolu1990 commented 3 years ago

The only "issue" was that performance on optimized Triad quad-threads would drop from ~220 MB/s to ~173 MB at some minor

To me, it can be explained by some bad luck with alignement of instruction/data ending up into more cash trashing. VexRiscv is quite subject to that, as it has a low number of way.

OpenMP rely on kernel locks for some stuff

<3

Would having the PMPs enabled helps protect the kernel from rogue accesses to memory ? (I don't see PmpPlugin has an option for the Litex SoC). I don't even know if the PMPs would be used by current Linux kernel...

The only thing the PMP would protect in our cases, is avoiding opensbi code/data being accessed by the kernel. So the usecase of the PMP for us is very very limited (can be usefull in others cases / security / multiple supervisor running)

Linux will never configure the PMP, that's something done at the machine mode level in opensbi as far as i have seen.

@rdolbeau

Sorry, I've yet to try the Saxon SoC so I can't tell.

No worries ^^ That's just a good thing to have another platform to cross check if bugs perssists. Still quite a lot of work to setup both flow and cross check. That's just in case of one day we have something realy bad happening.

HW bug don't disappear between reboots

Yes they can if your karma is too bad ^^ So far, to me, the worst hardware bug was taking randomly up to 1 hour to appears, so this kind of things won't necessarly appear in a recurent manner. But right, normaly, they take maximum 30 seconds.

If it comes back I'll try to characterize it but it's tough as I can't get GDB to be useful

Having a software GDB ? (no jtag involved) ? I would like to try that once. I did some changes recently in the DebugPlugin to allow this to propely work, but it need to enable the ebreak support in the CsrPlugin config, did you activated it ?

@rdolbeau I don't know if that something interresting for you, but i implemented USB host OHCI controller. That works well on linux, x11 is also running well : screenshot

It isn't ported on litex so far, but it is in the plan :)

jluebbe commented 3 years ago

forgot to say: BUG: scheduling while atomic is a kernel message and I've no idea of the cause...

This is reported by the kernel if schedule() is called for atomic context, which is either a kernel bug or caused by corruption. Atomic context is for example in a IRQ handler or while interrupts/preemption are disabled.

rdolbeau commented 3 years ago

@jluebbe Thanks, it pushes toward the idea of kernel corruption rather than hardware bug.

@Dolu1990 Yes, software GDB. Tried adding ebreak to the CSR plugin and some hardware breakpoints, but 'watch' still didn't work. I may easily have messed it, I don't really understand how that all works. Also GDB may be buggy; the version compiled by Buildroot has some weird warnings but can do some things, while the version I've recompiled from the B-enabled sources is useless, it won't even initialize properly).

Also, nice X11 :-) I've got it working as well (with WindowMaker rather than TWM :-) ) on the Litex FB, using my PS/2 controller but I only have a keyboard no mouse (yet, an ex-colleague has one in his attic...). USB host would be great, but doesn't that requires a PHY of some sort? Could it work with any FPGA board? (main reason to use PS/2 was, it"s a fairly simple 2-pins protocol and there's a Pmod with the right connector).

Dolu1990 commented 3 years ago

@rdolbeau

WindowMaker

Hooo i will have to try that :D I was looking at packages to try those last days XD

USB host would be great, but doesn't that requires a PHY of some sort

If you stay at USB 1.1 (1.5 + 12 Mbps) you only realy need 2 output of the FPGA + 2. No phy involved. I made that PMOD to get 4 ports : https://github.com/Dolu1990/pmod_usb_host_x4/blob/main/pmods.pdf

It add a current limitation + clamping diodes, but both are not sooooo important ^^

Could it work with any FPGA board

Yes, as far as i can see. Also, the CPU usage overhead is pretty low.

main reason to use PS/2 was, it"s a fairly simple 2-pins protocol and there's a Pmod with the right connector

Right, for USB host, i gess there is no PMOD that you can commercialy order as this is too specific.

rdolbeau commented 3 years ago

@Dolu1990 Do you plan to make a batch of those Pmods? Because now I want one :-) USB 1.1 is vastly fast enough for keyboard/mouse and probably even some basic USB storage. And with a powered USB hub, you don't even have to worry about the power supply of the board. It's way better than PS/2 :-)

SanadaShinken commented 3 years ago

Hi, @rdolbeau , @Dolu1990:

for USB host, in my FPGA board, I use USB3300, ULPI interface, for USB2.0. FT602Q for USB3.0. Is this design right or not?

BTW, for @Dolu1990's PMOD 4 USB ports, the OHCI is implemented by using migen or integrate verilog code base ?

BR, Sanada

SanadaShinken commented 3 years ago

The only "issue" was that performance on optimized Triad quad-threads would drop from ~220 MB/s to ~173 MB at some minor

To me, it can be explained by some bad luck with alignement of instruction/data ending up into more cash trashing. VexRiscv is quite subject to that, as it has a low number of way.

OpenMP rely on kernel locks for some stuff

<3

Would having the PMPs enabled helps protect the kernel from rogue accesses to memory ? (I don't see PmpPlugin has an option for the Litex SoC). I don't even know if the PMPs would be used by current Linux kernel...

The only thing the PMP would protect in our cases, is avoiding opensbi code/data being accessed by the kernel. So the usecase of the PMP for us is very very limited (can be usefull in others cases / security / multiple supervisor running)

Linux will never configure the PMP, that's something done at the machine mode level in opensbi as far as i have seen.

@rdolbeau

Sorry, I've yet to try the Saxon SoC so I can't tell.

No worries ^^ That's just a good thing to have another platform to cross check if bugs perssists. Still quite a lot of work to setup both flow and cross check. That's just in case of one day we have something realy bad happening.

HW bug don't disappear between reboots

Yes they can if your karma is too bad ^^ So far, to me, the worst hardware bug was taking randomly up to 1 hour to appears, so this kind of things won't necessarly appear in a recurent manner. But right, normaly, they take maximum 30 seconds.

If it comes back I'll try to characterize it but it's tough as I can't get GDB to be useful

Having a software GDB ? (no jtag involved) ? I would like to try that once. I did some changes recently in the DebugPlugin to allow this to propely work, but it need to enable the ebreak support in the CsrPlugin config, did you activated it ?

@rdolbeau I don't know if that something interresting for you, but i implemented USB host OHCI controller. That works well on linux, x11 is also running well : screenshot

It isn't ported on litex so far, but it is in the plan :)

Hi, @Dolu1990 :

For X window on ARTY, would you like to share the rootfs or provide some tutorial to show how to make the GUI work?

BR, Sanada

SanadaShinken commented 3 years ago

It is very detail. after my XC7A200T board is ready. I want to repeat your test on my board. BTW, does the linux kernel config and toolchain both enbale FPU support?

In the current linux-on-litex-vexriscv, no, you need to enable things by hand mostly - the default configs assume you're running without FPU (or compressed instructions).

You need to enable the FPU in the bitstream with --with-fpu(and optionally the number of FPUs with --cpu-per-fpu as they are shared). This should change other parameters to make sure all requirements are met (buses width, ...).

Then enable in the buildroot config from the repo (buildroot/configs/litex_vexriscv_defconfig) where you enable BR2_RISCV_ISA_CUSTOM_RVF and BR2_RISCV_ISA_CUSTOM_RVD (and optionally BR2_RISCV_ISA_CUSTOM_RVC for compressed instructions), and use BR2_RISCV_ABI_ILP32D=y (and BR2_RISCV_ABI_ILP32=n) so you get the hardfloat ABI instead of the soft-float one. You probably also need to have a more capable cross-compiler (I use a native one but that's another story) by enabling e.g.:

BR2_GCC_VERSION_10_X=y
BR2_GCC_ENABLE_OPENMP=y
BR2_INSTALL_LIBSTDCPP=y
BR2_TOOLCHAIN_BUILDROOT_CXX=y
BR2_TOOLCHAIN_BUILDROOT_FORTRAN=y

Then enable the FPU in the Linux kernel (buildroot/board/litex_vexriscv/linux.config), the first line depends whether you selected C in buildroot:

CONFIG_RISCV_ISA_C=y
CONFIG_FPU=y
CONFIG_SMP=y

Then rebuild the buildroot from scratch.

You probably need a matching OpenSBI as well, so I suggest to recompile it as well.

I think that should do it and enable you to run hard-float binaries on the SoC; just make sure the DTB/OpenSBI/kernel Image/buildroot all matches.

Hi, @rdolbeau :

Thank you!! for the bitstream, linux kernel, rootfs, toolchain are all FPU enabled. but I have to check the setting is the same as you mentioned. for opensbi, I didn't rebuild it with fpu enabled. I think follow your suggestion to rebuild it is the safe one.

BR, Sanada

Dolu1990 commented 3 years ago

@rdolbeau

Do you plan to make a batch of those Pmods?

So my plan with that pmod is to have a proper way to test the usb stuff. Didn't realy planned to produce batches of it as i'm not realy great at logistic XD. Did you already ordered PCB / parts and mounted them by the past ?

even some basic USB storage.

So, so far, there is my compatibility list :

WindowMaker

I didn't found it as a buildroot package, so i assum you builded asside right ?

I'm quite a noob when it is about software build flow. so i tried :

git clone https://github.com/window-maker/wmaker
cd wmaker
export PATH=PATH_TO_GCC_TOOLCHAIN:$PATH
./configure --host=riscv32-buildroot-linux-gnu --prefix=PATH_TO_ROOT/usr/local
make 
make install

But "make install" end up with a

/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: skipping incompatible /lib32/libc.so.6 when searching for /lib32/libc.so.6
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: cannot find /lib32/libc.so.6
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: cannot find /usr/lib/libc_nonshared.a
/media/data/open/SaxonSoc/artyA7SmpUsb/buildroot-build/host/bin/../lib/gcc/riscv32-buildroot-linux-gnu/10.2.0/../../../../riscv32-buildroot-linux-gnu/bin/ld: cannot find /lib/ld-linux-riscv32-ilp32d.so.1
collect2: error: ld returned 1 exit status
libtool:   error: error: relink 'libWINGs.la' with the above command before installing it

Did you had a similar issue ? Seems like it try to link the host binary by looking at my own PC libraries XD I will try to do a proper buildroot package XD

@SanadaShinken

for USB host, in my FPGA board, I use USB3300, ULPI interface, for USB2.0. FT602Q for USB3.0. Is this design right or not?

So, currently, i skiped some design phase to get the usb running without any phy, the design i did do not support external phy yet, but this should be possible in the future.

For X window on ARTY, would you like to share the rootfs or provide some tutorial to show how to make the GUI work?

Yes, my endgoal is to document that. But so far i'm trying to stabilise the config. I guess having a proper doc explaining / pointing to all the different config and their interaction woulde be usefull instead of swiming in a ocean of unknown stuff XD

Dolu1990 commented 3 years ago

@SanadaShinken

would you like to share the rootfs

i can do that too, let me just a few days to stabilise things ^^

rdolbeau commented 3 years ago

Did you already ordered PCB / parts and mounted them by the past ?

PCB yes, I made myself an adapter to plug an FPGA in the SBus slot of a 90s SPARCstation - but it was fairly simple design. Then I used SeeedStudio PCBA, they assembled/soldered all of it as I couldn't solder properly to save my life even back when my eyes were still working properly.

I just added some part numbers in Kicad (no idea what the ferrite should be or if the big capacitor are ceramic or tantalum or ... but for estimating the price it doesn't matter), and Seeed says about 280+ euros for 5 boards... assembly is expensive :-(

I didn't found it as a buildroot package, so i assum you builded asside right ?

Yes, and not just it :-) I self-hosted almost all the dependencies (thanks https://www.linuxfromscratch.org !), including Perl, Python, cmake, Xorg, ... using a B-enabled compiler (cross-compiled, as is the kernel itself). Roughly 1% of all instructions in binaries and libraries are from B; sh1add/sh2add/sh3add (they do a+(b<<n) with n=1,2,3, very useful for address computations; together they form Zba, a subset of B) see a lot of use.

I'm quite a noob when it is about software build flow. so i tried : (...) Did you had a similar issue ?

You need a cross-environment with the proper libraries, which is always complicated. That's why buildroot or yocto exists - hide the mess... I avoid cross-compiling as much as I can, except for bootstrapping a native environment as there's no other option.

So I did not have the issue because I cheated: Litex/VexRiscv compiled its own stuff :-) That's why I've filled the FPGA with cores. My compiler was cross-compiled using the Buildroot cross-compiler with ../gcc-10.2.0/configure --prefix=/usr/local --host=riscv32-buildroot-linux-gnu --target=riscv32-buildroot-linux-gnu --with-build-sysroot=/mnt --disable-multilib --enable-languages=c,c++,fortran, the micro-sd card root was mounted on /mnt. Then the binutils in a similar way. From there it just took the poor SoC a lot (and lot and lot) of time. And a small 20x20x10mm heatsink on top of the FPGA just in case :-) Quite a few packages were already in the buildroot, but most of them were recompiled anyway with the 'proper' B-enabled compiler.

Dolu1990 commented 3 years ago

@rdolbeau

There is what i ordered as components to mouser (do not include resistor above 27 ohm resistors). The quantity are over provisioned. I have enough material in stock to mount a second board and ship it hidden in a chocolate box. usbx4order.txt

The only "down-side" of the actual design is that you have to provide 5V to the pmod via the 4 pins connector.

and Seeed says about 280+ euros for 5 boards... assembly is expensive :-(

I never tried external assembly, realy had no idea of the cost.

https://www.linuxfromscratch.org

Ahh thanks, didn't knew those ressources :D

So I did not have the issue because I cheated: Litex/VexRiscv compiled its own stuff :-) That's why I've filled the FPGA with cores.

OMG i a good laugh XD I will try to push stuff as a buildroot package ^^

My compiler was cross-compiled using the Buildroot cross-compiler with

That's something i also wanted to try !

rdolbeau commented 3 years ago

@rdolbeau There is what i ordered as components to mouser

Thanks for the BOM. I'm guessing the long 26p header and the 47uF capa are for something else? (they don't appear on the schematic).

The only "down-side" of the actual design is that you have to provide 5V to the pmod via the 4 pins connector.

So does the PS/2 mod, not sure many keyboards would work with just 3.3V. Fortunately the Wukong expose the 5V input on one of the headers.

I never tried external assembly, realy had no idea of the cost.

For very small volume of small product it's not worth it - even naked PCBs.

For this particular Pmod, PCB is 34.9€ for 5 or 10 (!), and 62.26€ for 100... Adding assembly is >260€ for 5 (seeed has a couple of reference missing so it would be a bit more really), >420€ for 10, and >2000€ for 100. It seems those big 150uF capacitors are expensive, as are the connectors. There's about 65€ of fixed-cost, and above a threshold some fees are removed, so mid-volume pricing is more reasonable - but ~25€ per item shipped is probably still too much.

For a very small volume, you need to assemble it yourself. Or you need to make enough that you can amortize the fixed costs. Guess I'll stick with PS/2 for now - unless you manage to convince Digilent to put the Pmod in production :-)

Dolu1990 commented 3 years ago

Thanks for the BOM. I'm guessing the long 26p header and the 47uF capa are for something else? (they don't appear on the schematic).

Hoo right, that was some provision, and the 47 uf was a mistake XD You just need something to solder on the little 4 pins 2.54 spaced connector

Fortunately the Wukong expose the 5V input on one of the headers.

Same for ArtyA7. that's not ideal, but that was this or adding some step up chip. I locked for some afordable once, but they where all in some shity stuff to solder, excepted one which was out of stock until 2022 XD. So i just gived up on that.

not sure many keyboards would work with just 3.3V.

I guess it wont work.

For a very small volume, you need to assemble it yourself. Or you need to make enough that you can amortize the fixed costs. unless you manage to convince Digilent to put the Pmod in production :-)

Right XD

SanadaShinken commented 3 years ago

@SanadaShinken

would you like to share the rootfs

i can do that too, let me just a few days to stabilise things ^^

Hi, @Dolu1990 :

Many Thanks. I'll try what you talk about building windowmaker. If got some workable flow, I'll share the flow.

BR, Sanada

Dolu1990 commented 3 years ago

@SanadaShinken There it is : https://drive.google.com/file/d/1Ujr5UWIy7ArFIWtd7HkI6QyaRoG1Z2Ki/view?usp=sharing

Generated from : https://github.com/SpinalHDL/buildroot-spinal-saxon/blob/usb/configs/saxon_arty_a7_35_defconfig