Closed lsf37 closed 2 years ago
Instead of "too often", I should say "almost always". I think I have seen it pass once, but some other board had a problem for that test run.
Just confirming that this is indeed related to the build environment. The tests pass fine for the same configuration if the image was built with the old docker images.
So, it looks like something might actually be properly broken in SMP on RISCV: I've upgraded to gcc-11 temporarily, and now we're getting a consistent failure (timeout) on HIFIVE_debug_SMP_gcc_64
(release and verification builds succeed).
We also have SCHED0021
failing on HIFIVE_release_SMP_MCS_gcc_64
, not clear if that is related or not.
Here is an example run.
The docker container with gcc 11.1.0 for riscv64 is trustworthysystems/sel4-riscv:latest
(sha256:20bf07826ac0a1c81f9a620d21023ff0fe84e1300b03d6f804836da3cfcd1c75). It is pushed to docker hub, so you can pull it down, but I haven't updated the docker file repo yet, because this is still in testing and the -riscv
images are not otherwise used any more.
I think this issue is because on SMP there is a chance that the kernel thinks the HART ID for HART 1 is actually HART 0 on the hifive because of this line: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-riscv/crt0.S#L92 which should be CONFIG_FIRST_HART_ID instead of 0, and I think that the comment should refer to a0 instead of a1.
Very nice find!
I think this issue is because on SMP there is a chance that the kernel thinks the HART ID for HART 1 is actually HART 0 on the hifive because of this line: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-riscv/crt0.S#L92 which should be CONFIG_FIRST_HART_ID instead of 0, and I think that the comment should refer to a0 instead of a1.
This is a bit wrong. The cause of the test failure is due to the elfloader thinking that one of the cores has hart ID 0 when the valid range of IDs that can run in smode is 1-4 inclusive. At some point the ID gets lost:
Platform Name : SiFive Freedom U540
Platform Features : timer,mfdeleg
Platform HART Count : 4
Boot HART ID : 2
Boot HART ISA : rv64imafdcsu
BOOT HART Features : pmp,scounteren,mcounteren
BOOT HART PMP Count : 16
Firmware Base : 0x80000000
Firmware Size : 100 KB
Runtime SBI Version : 0.2
MIDELEG : 0x0000000000000222
MEDELEG : 0x000000000000b109
PMP0 : 0x0000000080000000-0x000000008001ffff (A)
PMP1 : 0x0000000000000000-0x0000007fffffffff (A,R,W,X)
ELF-loader started on (HART 1) (NODES 4)
paddr=[80200000..80625047]
Looking for DTB in CPIO archive...found at 8021dd58.
Loaded DTB from 8021dd58.
paddr=[84022000..84024fff]
ELF-loading image 'kernel' to 84000000
paddr=[84000000..84021fff]
vaddr=[ffffffff84000000..ffffffff84021fff]
virt_entry=ffffffff84000000
ELF-loading image 'sel4test-driver' to 84025000
paddr=[84025000..8444bfff]
vaddr=[10000..436fff]
virt_entry=1c3be
Main entry hart_id:1
Secondary entry hart_id:0 core_id:1
Secondary entry hart_id:4 core_id:2
Hart ID 1 core ID 0
Hart ID 0 core ID 1
Hart ID 4 core ID 2
Secondary entry hart_id:3 core_id:3
Hart ID 3 core ID 3
Enabling MMU and paging
And when it works:
Platform Name : SiFive Freedom U540
Platform Features : timer,mfdeleg
Platform HART Count : 4
Boot HART ID : 1
Boot HART ISA : rv64imafdcsu
BOOT HART Features : pmp,scounteren,mcounteren
BOOT HART PMP Count : 16
Firmware Base : 0x80000000
Firmware Size : 100 KB
Runtime SBI Version : 0.2
MIDELEG : 0x0000000000000222
MEDELEG : 0x000000000000b109
PMP0 : 0x0000000080000000-0x000000008001ffff (A)
PMP1 : 0x0000000000000000-0x0000007fffffffff (A,R,W,X)
ELF-loader started on (HART 1) (NODES 4)
paddr=[80200000..805fb047]
Looking for DTB in CPIO archive...found at 80219e00.
Loaded DTB from 80219e00.
paddr=[8401e000..84020fff]
ELF-loading image 'kernel' to 84000000
paddr=[84000000..8401dfff]
vaddr=[ffffffff84000000..ffffffff8401dfff]
virt_entry=ffffffff84000000
ELF-loading image 'sel4test-driver' to 84021000
paddr=[84021000..84427fff]
vaddr=[10000..416fff]
virt_entry=1bede
Main entry hart_id:1
Hart ID 1 core ID 0
Secondary entry hart_id:4 core_id:3
Secondary entry hart_id:2 core_id:1
Secondary entry hart_id:3 core_id:2
Hart ID 4 core ID 3
Hart ID 2 core ID 1
Hart ID 3 core ID 2
Enabling MMU and paging
This corruption is due to register s0
getting overwritten, likely during the call to clear_bss
with no stack pointer set. So when the hartid of the boot core is then restored to a0
from s0
it has become 0.
https://github.com/seL4/seL4_tools/pull/135 solves this issue as it sets the stack pointer before clear_bss
is called.
There are more issues with the multicore boot when switching the hart actually. I've tried to fix them in the last commit in https://github.com/seL4/seL4_tools/pull/132 and will put this on top of https://github.com/seL4/seL4_tools/pull/135 becuase the fix for the stack setup comes handy then also.
I moved the changes to resolve early boot issues into it's own PR: https://github.com/seL4/seL4_tools/pull/136
This should now be resolved.
Yes, tests on the hifive seem to be running smoothly now again. Thanks for figuring that one out!
The config
HIFIVE_verification_SMP_MCS_gcc_64
seem to now very often fail the testIPC0001
by hanging at that test.This may be a result of the new build environment with downgrades riscv-gcc from version 10 to version 8 (as opposed to the upgrade from 8 to 10 on all other platforms).
The other configurations on hifive pass and this configuration passes on other boards.
For a sample, see https://github.com/seL4/seL4/runs/4494651972?check_suite_focus=true#step:4:12004