seL4 / sel4test

Test suite for seL4.
http://sel4.systems
Other
25 stars 61 forks source link

hifive on MSC+SMP too often fails IPC0001 #64

Closed lsf37 closed 2 years ago

lsf37 commented 2 years ago

The config HIFIVE_verification_SMP_MCS_gcc_64 seem to now very often fail the test IPC0001 by hanging at that test.

This may be a result of the new build environment with downgrades riscv-gcc from version 10 to version 8 (as opposed to the upgrade from 8 to 10 on all other platforms).

The other configurations on hifive pass and this configuration passes on other boards.

For a sample, see https://github.com/seL4/seL4/runs/4494651972?check_suite_focus=true#step:4:12004

lsf37 commented 2 years ago

Instead of "too often", I should say "almost always". I think I have seen it pass once, but some other board had a problem for that test run.

lsf37 commented 2 years ago

Just confirming that this is indeed related to the build environment. The tests pass fine for the same configuration if the image was built with the old docker images.

lsf37 commented 2 years ago

So, it looks like something might actually be properly broken in SMP on RISCV: I've upgraded to gcc-11 temporarily, and now we're getting a consistent failure (timeout) on HIFIVE_debug_SMP_gcc_64 (release and verification builds succeed).

We also have SCHED0021 failing on HIFIVE_release_SMP_MCS_gcc_64, not clear if that is related or not.

Here is an example run.

The docker container with gcc 11.1.0 for riscv64 is trustworthysystems/sel4-riscv:latest (sha256:20bf07826ac0a1c81f9a620d21023ff0fe84e1300b03d6f804836da3cfcd1c75). It is pushed to docker hub, so you can pull it down, but I haven't updated the docker file repo yet, because this is still in testing and the -riscv images are not otherwise used any more.

kent-mcleod commented 2 years ago

I think this issue is because on SMP there is a chance that the kernel thinks the HART ID for HART 1 is actually HART 0 on the hifive because of this line: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-riscv/crt0.S#L92 which should be CONFIG_FIRST_HART_ID instead of 0, and I think that the comment should refer to a0 instead of a1.

lsf37 commented 2 years ago

Very nice find!

kent-mcleod commented 2 years ago

I think this issue is because on SMP there is a chance that the kernel thinks the HART ID for HART 1 is actually HART 0 on the hifive because of this line: https://github.com/seL4/seL4_tools/blob/master/elfloader-tool/src/arch-riscv/crt0.S#L92 which should be CONFIG_FIRST_HART_ID instead of 0, and I think that the comment should refer to a0 instead of a1.

This is a bit wrong. The cause of the test failure is due to the elfloader thinking that one of the cores has hart ID 0 when the valid range of IDs that can run in smode is 1-4 inclusive. At some point the ID gets lost:

Platform Name       : SiFive Freedom U540
Platform Features   : timer,mfdeleg
Platform HART Count : 4
Boot HART ID        : 2
Boot HART ISA       : rv64imafdcsu
BOOT HART Features  : pmp,scounteren,mcounteren
BOOT HART PMP Count : 16
Firmware Base       : 0x80000000
Firmware Size       : 100 KB
Runtime SBI Version : 0.2

MIDELEG : 0x0000000000000222
MEDELEG : 0x000000000000b109
PMP0    : 0x0000000080000000-0x000000008001ffff (A)
PMP1    : 0x0000000000000000-0x0000007fffffffff (A,R,W,X)
ELF-loader started on (HART 1) (NODES 4)
  paddr=[80200000..80625047]
Looking for DTB in CPIO archive...found at 8021dd58.
Loaded DTB from 8021dd58.
   paddr=[84022000..84024fff]
ELF-loading image 'kernel' to 84000000
  paddr=[84000000..84021fff]
  vaddr=[ffffffff84000000..ffffffff84021fff]
  virt_entry=ffffffff84000000
ELF-loading image 'sel4test-driver' to 84025000
  paddr=[84025000..8444bfff]
  vaddr=[10000..436fff]
  virt_entry=1c3be
Main entry hart_id:1
Secondary entry hart_id:0 core_id:1
Secondary entry hart_id:4 core_id:2
Hart ID 1 core ID 0
Hart ID 0 core ID 1
Hart ID 4 core ID 2
Secondary entry hart_id:3 core_id:3
Hart ID 3 core ID 3
Enabling MMU and paging

And when it works:

  Platform Name       : SiFive Freedom U540
  Platform Features   : timer,mfdeleg
  Platform HART Count : 4
  Boot HART ID        : 1
  Boot HART ISA       : rv64imafdcsu
  BOOT HART Features  : pmp,scounteren,mcounteren
  BOOT HART PMP Count : 16
  Firmware Base       : 0x80000000
  Firmware Size       : 100 KB
  Runtime SBI Version : 0.2

  MIDELEG : 0x0000000000000222
  MEDELEG : 0x000000000000b109
  PMP0    : 0x0000000080000000-0x000000008001ffff (A)
  PMP1    : 0x0000000000000000-0x0000007fffffffff (A,R,W,X)
  ELF-loader started on (HART 1) (NODES 4)
    paddr=[80200000..805fb047]
  Looking for DTB in CPIO archive...found at 80219e00.
  Loaded DTB from 80219e00.
     paddr=[8401e000..84020fff]
  ELF-loading image 'kernel' to 84000000
    paddr=[84000000..8401dfff]
    vaddr=[ffffffff84000000..ffffffff8401dfff]
    virt_entry=ffffffff84000000
  ELF-loading image 'sel4test-driver' to 84021000
    paddr=[84021000..84427fff]
    vaddr=[10000..416fff]
    virt_entry=1bede
  Main entry hart_id:1
  Hart ID 1 core ID 0
  Secondary entry hart_id:4 core_id:3
  Secondary entry hart_id:2 core_id:1
  Secondary entry hart_id:3 core_id:2
  Hart ID 4 core ID 3
  Hart ID 2 core ID 1
  Hart ID 3 core ID 2
  Enabling MMU and paging

This corruption is due to register s0 getting overwritten, likely during the call to clear_bss with no stack pointer set. So when the hartid of the boot core is then restored to a0 from s0 it has become 0.

https://github.com/seL4/seL4_tools/pull/135 solves this issue as it sets the stack pointer before clear_bss is called.

axel-h commented 2 years ago

There are more issues with the multicore boot when switching the hart actually. I've tried to fix them in the last commit in https://github.com/seL4/seL4_tools/pull/132 and will put this on top of https://github.com/seL4/seL4_tools/pull/135 becuase the fix for the stack setup comes handy then also.

kent-mcleod commented 2 years ago

I moved the changes to resolve early boot issues into it's own PR: https://github.com/seL4/seL4_tools/pull/136

kent-mcleod commented 2 years ago

This should now be resolved.

lsf37 commented 2 years ago

Yes, tests on the hifive seem to be running smoothly now again. Thanks for figuring that one out!