ucb-bar / chipyard

An Agile RISC-V SoC Design Framework with in-order cores, out-of-order cores, accelerators, and more
https://chipyard.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
1.63k stars 645 forks source link

Cannot boot linux with RocketChip+Vector Config #2103

Open franktaTian opened 4 days ago

franktaTian commented 4 days ago

Background Work

Chipyard Version and Hash

Release: 1.13.0 Hash: 86ec78

OS Setup

Ex: Output of uname -a + lsb_release -a + printenv + conda list Linux i7700 5.4.0-198-generic #218-Ubuntu SMP Fri Sep 27 20:18:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux LSB Version: core-11.1.0ubuntu2-noarch:printing-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04 Codename: focal

Other Setup

Ex: Prior steps taken / Documentation Followed / etc...

Current Behavior

I added "new saturn.rocket.WithRocketVectorUnit(256, 64, VectorParams.refParams) ++" in FireSimRocketConfig and build bit stream and linux(Firemarshal comes with this Chipyard version) follwing guide from firesim.When I try to boot Linux kernal,It pacnic. When I reverse FireSimRocketConfig back , and everything works fine with the same Linux kernal.

Expected Behavior

Boot Linux correctly with rocket vector added.

Other Information

` [ 26.138205] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 26.159213] Oops [#1] [ 26.164645] Modules linked in: [ 26.171809] CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-00004-g67bc4513761f-dirty #32 [ 26.190212] Hardware name: ucb-bar,chipyard (DT) [ 26.200331] Workqueue: events_unbound async_run_entry_fn [ 26.212504] epc : 0x0 [ 26.217911] ra : __vm_enough_memory+0x2e/0x136 [ 26.228327] epc : 0000000000000000 ra : ffffffff801512b6 sp : ffffffc8001a38e0 [ 26.243939] gp : ffffffff852f26f8 tp : ffffffd880186c00 t0 : ffffffff84d6cd48 [ 26.259554] t1 : 0000000000000001 t2 : 0000000000000000 s0 : ffffffc8001a3920 [ 26.275144] s1 : 0000000000000001 a0 : ffffffff8532ac40 a1 : 0000000000000001 [ 26.290730] a2 : 000000000007b39f a3 : ffffffff85212b70 a4 : 8000000000000000 [ 26.306341] a5 : ffffffff85212b70 a6 : 0000000000000000 a7 : ffffffff85290c78 [ 26.321937] s2 : 0000000000000000 s3 : 0000000000000001 s4 : 0000000000000000 [ 26.337521] s5 : ffffffff852f22bc s6 : 0000000000000000 s7 : 0000000000000000 [ 26.353104] s8 : ffffffffffffffff s9 : 0000000000000003 s10: 0000000000000000 [ 26.368667] s11: 0000000000000fff t3 : ffffffffffffffff t4 : ffffffffffffffff [ 26.384284] t5 : ffffffffffffffff t6 : 000000000000ffff [ 26.395815] status: 0000000200000120 badaddr: 0000000000000000 cause: 000000000000000c [ 26.412959] Code: Unable to access instruction at 0xffffffffffffffec. [ 26.428124] ---[ end trace 0000000000000000 ]---

`

franktaTian commented 4 days ago

I think, the new load/store mechanic after adding vector causes this problem. Do you ever boot linux successfully with configurations having saturn vector ?

jerryz123 commented 4 days ago

I will investigate. This worked on a FPGA prototype, but likely firesim exposed some other bug

jerryz123 commented 4 days ago

I struggle to see how the kernel panic report would indicate any problem due to vectors... it reports a fetch page fault, and the vector support made no modifications to the frontend. Additionally, there is no vector code in the kernel by default, so it seems unlikely that errant vector instructions would have corrupted something.

I will attempt to reproduce

franktaTian commented 3 days ago

Yes, it is strange.I also know there is no vector code in the kernal by default.

franktaTian commented 2 days ago

HI, I reversed the modification in the TargetConfigs.scala.And generate rocketchip+vector by modifying the build_receipes.yaml as follow: alveo_u250_firesim_rocket_singlecore_vector_no_nic: PLATFORM: xilinx_alveo_u250 TARGET_PROJECT: firesim TARGET_PROJECT_MAKEFRAG: ../../generators/firechip/chip/src/main/makefrag/firesim DESIGN: FireSim TARGET_CONFIG: WithDefaultFireSimBridges_WithFireSimConfigTweaks_chipyard.REFV256D128RocketConfig PLATFORM_CONFIG: BaseXilinxAlveoU250Config deploy_quintuplet: null platform_config_args: fpga_frequency: 60 build_strategy: TIMING post_build_hook: null metasim_customruntimeconfig: null bit_builder_recipe: bit-builder-recipes/xilinx_alveo_u250.yaml Everything works fine ---Linux kernal boots without panic .

franktaTian commented 2 days ago

But when I try to add another recipe as follow: alveo_u250_firesim_rocket_singlecore_vector_clock_crossing_no_nic: PLATFORM: xilinx_alveo_u250 TARGET_PROJECT: firesim TARGET_PROJECT_MAKEFRAG: ../../generators/firechip/chip/src/main/makefrag/firesim DESIGN: FireSim TARGET_CONFIG: WithDefaultFireSimBridges_WithFireSimTestChipConfigTweaks_chipyard.REFV256D128RocketConfig PLATFORM_CONFIG: BaseXilinxAlveoU250Config deploy_quintuplet: null platform_config_args: fpga_frequency: 60 build_strategy: TIMING post_build_hook: null metasim_customruntimeconfig: null bit_builder_recipe: bit-builder-recipes/xilinx_alveo_u250.yaml and build bitstream successfully, the same Linux boot with panic ,but can continue to login :

`running /etc/init.d/S10mdev Starting mdev: OK [ 0.830316] find[81]: unhandled signal 11 code 0x1 at 0xffffffff80060004 [ 0.830362] CPU: 0 PID: 81 Comm: find Tainted: G O 6.6.0-00004-g67bc4513761f #2 [ 0.830386] Hardware name: ucb-bar,chipyard (DT) [ 0.830400] epc : ffffffff80060004 ra : 00000000000bf26c sp : 0000003fd9c545f0 [ 0.830506] gp : 00000000001be3f8 tp : 00000000001c5760 t0 : 0000000000000002 [ 0.830732] t1 : 62616c732f6c656e t2 : 00000000001de6a0 s0 : 0000003fd9c549b0 [ 0.830958] s1 : 0000000000000001 a0 : 0000000000000000 a1 : 00000000001de680 [ 0.831184] a2 : 0000003fd9c545f0 a3 : 0000000000000100 a4 : 0000000000000000 [ 0.831410] a5 : fffffffffffff000 a6 : 62616c732f6c656e a7 : 000000000000004f [ 0.831636] s2 : 00000000001de680 s3 : 0000003fd9c545f0 s4 : 0000003fadac8010 [ 0.831862] s5 : 0000000000000001 s6 : 00000000001de680 s7 : 0000000000010248 [ 0.832088] s8 : 0000002ae58185c0 s9 : 0000002ae5821cd0 s10: 0000002ae5824460 [ 0.832314] s11: 0000002ae5809bc8 t3 : 2f2f2f2f2f2f2f2f t4 : 0000003fd9c54630 [ 0.832540] t5 : 0000000000000001 t6 : 0000000000000000 [ 0.832706] status: 8000000200006020 badaddr: ffffffff80060004 cause: 000000000000000c running /etc/init.d/S40network Starting network: OK running /etc/init.d/S99run running /etc/init.d/S40network Starting network: OK running /etc/init.d/S99run launching firemarshal workload run/command firemarshal workload run/command done

Welcome to Buildroot buildroot login: root

cat /proc/cpuinfo

processor : 0 hart : 0 isa : rv64imafdcbv_zicntr_zicsr_zifencei_zihpm_zba_zbb_zbs mmu : sv39 uarch : sifive,rocket0 mvendorid : 0x0 marchid : 0x1 mimpid : 0x20181004 # ` Any help?

jerryz123 commented 2 days ago

Thanks for investigating. This points at a bug in the multi-clock handling (The difference between TestChipConfigTweaks and ConfigTweaks is that the "test chip" variant adds CDCs and simulates multi-clock in firesim).

I suspect just a base Rocket with multi-clock will also fail. I can investigate this specifically.

jerryz123 commented 2 days ago

It looks like the default rational crossing direction for Rocket's Rational CDCs did not match the clocking configuration in TestChipConfigTweaks.

This PR changes the default Rocket RationalCrossing to support both fast-to-slow and slow-to-fast directions: https://github.com/chipsalliance/rocket-chip/pull/3693

Alternatively, you can change WithTestChipConfigTweaks to add async CDCs to the RocketTiles.

franktaTian commented 1 day ago

It looks like the default rational crossing direction for Rocket's Rational CDCs did not match the clocking configuration in TestChipConfigTweaks.

This PR changes the default Rocket RationalCrossing to support both fast-to-slow and slow-to-fast directions: chipsalliance/rocket-chip#3693

Alternatively, you can change WithTestChipConfigTweaks to add async CDCs to the RocketTiles.

Ok , I will try it.

franktaTian commented 1 day ago

It looks like the default rational crossing direction for Rocket's Rational CDCs did not match the clocking configuration in TestChipConfigTweaks.

This PR changes the default Rocket RationalCrossing to support both fast-to-slow and slow-to-fast directions: chipsalliance/rocket-chip#3693

Alternatively, you can change WithTestChipConfigTweaks to add async CDCs to the RocketTiles.

Yes. Everything works fine. I modified Configs.scala as [chipsalliance/rocket-chip#3693], and generate bitstream again . Linux kernal can boot without panic. I also generate another version " add async CDCs to the RocketTiles" , Linux kernal can boot without panic , but when try poweroff , it panic. `

poweroff

Stopping network: OK

Stopping mdev: stopped process in pidfile '/var/run/mdev.pid' (pid 80) OK Stopping klogd: [ 68.403438] Oops - illegal instruction [#1] [ 68.403454] Modules linked in: iceblk(O) icenet(O) [ 68.403476] CPU: 0 PID: 123 Comm: rm Tainted: G O 6.6.0-00004-g67bc4513761f #2 [ 68.403486] Hardware name: ucb-bar,chipyard (DT) [ 68.403492] epc : do_raw_spin_unlock+0x88/0x11e [ 68.403510] ra : handle_page_fault+0x128/0x390 [ 68.403532] epc : ffffffff80060004 ra : ffffffff8000a194 sp : ffffffc800573e70 [ 68.403540] gp : ffffffff812f26f8 tp : ffffffd8814f3c00 t0 : 0000000000000040 [ 68.403548] t1 : 0000000000001000 t2 : 0000020000000000 s0 : ffffffc800573ec0 [ 68.403554] s1 : ffffffc800573ee0 a0 : 0000000000000400 a1 : 0000000000000000 [ 68.403562] a2 : ffffffd8814f3c01 a3 : 0000000000000000 a4 : ffffffd8814f4c00 [ 68.403568] a5 : 0000000000000400 a6 : 0000000000000402 a7 : 0000000000000406 [ 68.403574] s2 : 000000000000000d s3 : 0000000000000001 s4 : 0000002ad0fd3b68 [ 68.403582] s5 : ffffffd8814f3c00 s6 : ffffffd880ddcb80 s7 : 0000000000000254 [ 68.403588] s8 : 000000000000000d s9 : 0000000000000076 s10: 0000000000000003 [ 68.403596] s11: 0000002ac481ebc8 t3 : 0000000000000000 t4 : 000000000000003f [ 68.403602] t5 : ffffffff81213058 t6 : ffffffff81213078 [ 68.403608] status: 0000000200000120 badaddr: 00000000000f0007 cause: 0000000000000002 [ 68.403616] [] do_raw_spin_unlock+0x88/0x11e [ 68.403630] [] do_page_fault+0x1e/0x36 [ 68.403642] [] ret_from_exception+0x0/0x64 [ 68.403666] Code: 17c2 93c1 9023 00f4 60e2 6442 64a2 6105 8082 9123 (0007) 000f [ 68.403672] ---[ end trace 0000000000000000 ]--- [ 68.403678] Kernel panic - not syncing: Fatal exception in interrupt [ 68.407748] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---`

I just replace "WithRationalCDCs" with "WithAsynchronousCDCs(depth=8, sync=3)"

But anyway , by now we have workable version with Rocket+Vector and Clock Crossing(Rational) . I think , clock crossing in ASIC design version is a must-be, although in firesim ,all clock input are connected to one host clock.