Closed nmeum closed 5 years ago
I think you are correct about the race condition. One quick solution to fix this is to have a global flag indicating that cold boot is done. Any hart entering warm boot should just return without entering to wfi loop.
First of all thanks for catching this issue and reporting it.
The described race condition does exist in current sbi_hart_wait_for_coldboot() and sbi_hart_wake_coldboot_harts() implementation. Although, it is very unlikely that we will hit this race condition on real HW but on emulators/simulators we can certainly hit this race conditions because for emulators/simulator all HARTs are threads of a process under a UNIX operating system and if secondary HART threads don't get enough chance to run then boot HART (or primary HART) will call sbi_hart_wake_coldboot_harts() before secondary HARTs call sbi_hart_wait_for_coldboot()
A global flag for coldboot done protected by coldboot lock will certainly help but we need re-arrange the way coldboot lock is acquired/released in sbi_hart_wake_coldboot_harts() and sbi_hart_wait_for_coldboot().
I will send a patch to fix this but all of us will have to try it separately as well.
I have send patches to OpenSBI mailing list for this issue.
You can find patches in fixes_v1 branch at: https://github.com/avpatel/opensbi.git
I believe this issue is fixed now.
Can we close this ?
Sure, thanks for fixing it!
I guess this is probably more of question than a bug report so I hope you don't mind me opening an issue for this. While running OpenSBI with a custom riscv simulator I noticed that some of my simulated harts are stuck in
sbi_hart_wait_for_coldboot
. After briefly reading the OpenSBI code I wonder if this is due to a race condition.If I understand the terminology correctly OpenSBI differentiates between cold- and warmboot. Where coldboot means "full init" and is only performed once and warmboot means "little init".
https://github.com/riscv/opensbi/blob/a2a7763ac7651e64c3928ba0a13be9317bd48c4d/lib/sbi/sbi_init.c#L179-L182
If the current hart is booted in coldboot it invokes
sbi_hart_wake_coldboot_harts
. If it is booted in warmboot it invokessbi_hart_wait_for_coldboot
.The
sbi_hart_wait_for_coldboot
function periodically checks if the MIP register of the current hart has been set. Thesbi_hart_wake_coldboot_harts
sets the MIP register of all harts except the current one, but only if the hart has been marked as waiting in a bitmap (coldboot_wait_bitmap
) beforehand. This bitmap is set fromsbi_hart_wait_for_coldboot
.https://github.com/riscv/opensbi/blob/3f738f5897a6694b8630d3a9c6751f49c3c7d540/lib/sbi/sbi_hart.c#L355
Since coldboot is only performed once
sbi_hart_wake_coldboot_harts
is only called once. Now: If my outlined observations are correct an implicit assumption made here is that all harts booted in warmboot reached thesbi_hart_wait_for_coldboot
function (and modified the bitmap) before thesbi_hart_wake_coldboot_harts
function is invoked. If this assumption doesn't hold some harts will never return from thesbi_hart_wait_for_coldboot
function.This is assumption is what I consider to be a race condition as it doesn't not necessarily hold, right? The problem in this regard being the bitmap which has been introduced in 6f02b6938f8e607d38ee8de5a59581d554af0d3f if I remove the bitmap I don't run into this problem. However, as I said I encountered this with a custom riscv simulator and it might as well be a bug in the emulator.