riscv-software-src / opensbi

RISC-V Open Source Supervisor Binary Interface
Other
1.04k stars 514 forks source link

Race condition in sbi_hart_wake_coldboot_harts? #136

Closed nmeum closed 5 years ago

nmeum commented 5 years ago

I guess this is probably more of question than a bug report so I hope you don't mind me opening an issue for this. While running OpenSBI with a custom riscv simulator I noticed that some of my simulated harts are stuck in sbi_hart_wait_for_coldboot. After briefly reading the OpenSBI code I wonder if this is due to a race condition.

If I understand the terminology correctly OpenSBI differentiates between cold- and warmboot. Where coldboot means "full init" and is only performed once and warmboot means "little init".

https://github.com/riscv/opensbi/blob/a2a7763ac7651e64c3928ba0a13be9317bd48c4d/lib/sbi/sbi_init.c#L179-L182

If the current hart is booted in coldboot it invokes sbi_hart_wake_coldboot_harts. If it is booted in warmboot it invokes sbi_hart_wait_for_coldboot.

The sbi_hart_wait_for_coldboot function periodically checks if the MIP register of the current hart has been set. The sbi_hart_wake_coldboot_harts sets the MIP register of all harts except the current one, but only if the hart has been marked as waiting in a bitmap (coldboot_wait_bitmap) beforehand. This bitmap is set from sbi_hart_wait_for_coldboot.

https://github.com/riscv/opensbi/blob/3f738f5897a6694b8630d3a9c6751f49c3c7d540/lib/sbi/sbi_hart.c#L355

Since coldboot is only performed once sbi_hart_wake_coldboot_harts is only called once. Now: If my outlined observations are correct an implicit assumption made here is that all harts booted in warmboot reached the sbi_hart_wait_for_coldboot function (and modified the bitmap) before the sbi_hart_wake_coldboot_harts function is invoked. If this assumption doesn't hold some harts will never return from the sbi_hart_wait_for_coldboot function.

This is assumption is what I consider to be a race condition as it doesn't not necessarily hold, right? The problem in this regard being the bitmap which has been introduced in 6f02b6938f8e607d38ee8de5a59581d554af0d3f if I remove the bitmap I don't run into this problem. However, as I said I encountered this with a custom riscv simulator and it might as well be a bug in the emulator.

atishp04 commented 5 years ago

I think you are correct about the race condition. One quick solution to fix this is to have a global flag indicating that cold boot is done. Any hart entering warm boot should just return without entering to wfi loop.

avpatel commented 5 years ago

First of all thanks for catching this issue and reporting it.

The described race condition does exist in current sbi_hart_wait_for_coldboot() and sbi_hart_wake_coldboot_harts() implementation. Although, it is very unlikely that we will hit this race condition on real HW but on emulators/simulators we can certainly hit this race conditions because for emulators/simulator all HARTs are threads of a process under a UNIX operating system and if secondary HART threads don't get enough chance to run then boot HART (or primary HART) will call sbi_hart_wake_coldboot_harts() before secondary HARTs call sbi_hart_wait_for_coldboot()

A global flag for coldboot done protected by coldboot lock will certainly help but we need re-arrange the way coldboot lock is acquired/released in sbi_hart_wake_coldboot_harts() and sbi_hart_wait_for_coldboot().

I will send a patch to fix this but all of us will have to try it separately as well.

avpatel commented 5 years ago

I have send patches to OpenSBI mailing list for this issue.

You can find patches in fixes_v1 branch at: https://github.com/avpatel/opensbi.git

avpatel commented 5 years ago

I believe this issue is fixed now.

Can we close this ?

nmeum commented 5 years ago

Sure, thanks for fixing it!