WFI doesn't return when a IPI is issued

atishp04 commented 6 years ago

According to the specs, WFI can be used to put the cpu in a low power mode. Any enabled interrupt will return from WFI and the cpu will start executing after that. Currently, Linux kernel running under QEMU doesn't return from WFI when a IPI (software interrupt) is issued.

This happens because of the following reason.

riscv_set_local_interrupt() ignores cpu interrupt only when either mip or new mip register is zero. It may happen that both are nonzero but interrupt is issued from Supervisor mode.

The following temporary fix has been suggested by @michaeljclark which seems to resolve the issue for now. This may or may not be the final fix.

--- a/target/riscv/op_helper.c
+++ b/target/riscv/op_helper.c
@@ -580,9 +580,9 @@ void riscv_set_local_interrupt(RISCVCPU *cpu, target_ulong mask, int value)
     target_ulong old_mip = cpu->env.mip;
     cpu->env.mip = (old_mip & ~mask) | (value ? mask : 0);

-    if (cpu->env.mip && !old_mip) {
+    if (cpu_has_work(CPU(cpu))) {
         cpu_interrupt(CPU(cpu), CPU_INTERRUPT_HARD);
-    } else if (!cpu->env.mip && old_mip) {
+    } else {
         cpu_reset_interrupt(CPU(cpu), CPU_INTERRUPT_HARD);
     }
 }

This is required as a part of cpu hotplug operation. During cpu offline operation, WFI is issued for that cpu to put it in low power mode. During online, an IPI is issued for that cpu so that it can return from WFI and resume its operation.

Here is the series that enables this feature. http://lists.infradead.org/pipermail/linux-riscv/2018-April/000509.html

or github link: https://github.com/atishp04/riscv-linux/commits/cpu_hotplug_v2_working

@sorear

sorear commented 6 years ago

Using cpu_has_work from another thread doesn't seem safe.

michaeljclark commented 6 years ago

I tested the fix that I proposed on IRC and I agree that it's not safe to reference mie in another thread. I tested the change and had rcu messages after a few hours under load.

Currently, we need to raise the interrupt even if it is unmasked unless the mie CSR is also updated to raise and lower interrupts (atomically) i.e. when the mie interrupt enable mask changes.

@atishp04 do you have mie.SSIE enabled on the sleeping thread? there could be a case of a missed edge. i.e. cpu_interrupt(CPU(cpu), CPU_INTERRUPT_HARD); is called but cpu_has_work returns to sleep.

IPIs must be working in general, so it is puzzling?

atishp04 commented 6 years ago

@michaeljclark . Do you mean sleeping thread in kernel or QEMU perspective?

The cpu that is about to offline, disables all interrupt except software interrupt (SSIE is enabled) and calls WFI. IPI is issued from another cpu during online operation. I guess QEMU manages every cpu in a separate thread and put it to sleep for WFI instruction. If that's the case mie.SSIE would be enabled on the sleeping thread.

michaeljclark commented 6 years ago

You could consider trying something like this:

    if (new_mip) {
         cpu_interrupt(CPU(cpu), CPU_INTERRUPT_HARD);
    } else {
         cpu_reset_interrupt(CPU(cpu), CPU_INTERRUPT_HARD);
    }

I don't have time to test right now because it takes approximately 24 hours (at least 2 hours) under heavy parallel load to make sure we don't get _rcusched self-detected stall on CPU. As @sorear mentioned, accessing env->mie from another thread is not safe.

I've actually updated the code a little to make updates truly atomic but I have also been very careful to maintain the same logic as before (besides adding tighter atomicity). The prior code made two calls to riscv_set_local_interrupts. I've updated the interface to take a mask (similar to cpu_interrupt but set and clear are achieved by providing a value and a mask):

The changes to the CSR code were required because SiFive have work that requires atomic CSRs. I'm working on a branch that I will be submitting to qemu-devel for review. The new CSR implementation allows model specific CSRs which can be truly atomic (the prior mechanism could return CSR values that were different to the ones it updated; this is particularly important for CSRs that set or clear bits):

https://github.com/riscv/riscv-qemu/commits/qemu-2.13-for-upstream

atishp04 commented 6 years ago

I have a spare machine and I can leave it running for days. Can you please point me to the parallel benchmark that triggered rcu stalls? @michaeljclark

michaeljclark commented 6 years ago

I was able to trigger the _rcusched self-detected stall on CPU during a gcc bootstrap on a 4 CPU Linux SMP instance running the Fedora image. The fedora image has enough tools to perform a gcc boostrap. You can download the Fedora image stage4-disk.img.xz and bbl here.

https://fedorapeople.org/groups/risc-v/disk-images/

I used a point to point bridge with a private ip and masquarading (SNAT) so that the Fedora image (192.168.0.2) can reach the Internet via a bridge (192.168.0.1) on the host. This is required so that dnf can be used to install necessary dependencies.

On the x86_64 Linux host (4 vCPU c5.xlarge AWS instance):

$ cat setup-bridge.sh 
sudo ip link add br0 type bridge
sudo ip address add 192.168.0.1/24 broadcast + dev br0
sudo ip link set br0 up
sudo sysctl -w net.ipv4.ip_forward=1
sudo iptables -F
sudo iptables -F -t nat
sudo iptables -t nat -A POSTROUTING -o ens5 -j MASQUERADE
sudo iptables -A FORWARD -i ens5 -o br0 -m state \
   --state RELATED,ESTABLISHED -j ACCEPT
sudo iptables -A FORWARD -i br1 -o ens5 -j ACCEPT

$ cat start-qemu.sh 
sudo ./riscv-qemu/riscv64-softmmu/qemu-system-riscv64 -m 4096M -smp cpus=4 -nographic -machine virt -kernel bbl -append "root=/dev/vda ro console=ttyS0" -drive file=stage4-disk.img,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -netdev type=tap,script=./ifup,downscript=./ifdown,id=net0 -device virtio-net-device,netdev=net0

$ cat ifup 
#!/bin/sh
brctl addif br0 $1
ifconfig $1 up

$ cat ifdown 
#!/bin/sh
ifconfig $1 down
brctl delif br0 $1

$ cat /etc/hosts
127.0.0.1 localhost
192.168.0.2 qemu

Expand the Fedora image disk before using it as the default 4GB is too small for a gcc bootstrap:

$ xz --decompress --keep stage4-disk.img.xz
# append 16GB as the 4GB image is a wee bit small
$ dd if=/dev/zero bs=1M count=16384 >> stage4-disk.img

In the Fedora RISC-V guest:

# use the extra space we added to the image
$ resize2fs /dev/vda

$ cat /etc/sysconfig/network
NETWORKING=yes
GATEWAY=192.168.0.1

$ cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=static
IPADDR=192.168.0.2
NETMASK=255.255.255.0
ONBOOT=yes

$ sudo chkconfig network on
$ sudo systemctl disable systemd-networkd-wait-online.service

# other Fedora futzing such as reboots to check networking comes up
# sudo useradd -m myuser
# add ssh authorized keys to myuser
# add myuser to `wheel` in `/etc/group` so that it can use `sudo`
# etc...

$ ssh myuser@qemu
$ sudo dnf install bc bison curl expat-devel expect flex gawk gmp-devel libtool libmpc-devel mpfr-devel texinfo zlib-devel ntpdate
$ curl -O https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
$ tar xJf gcc-7.3.0.tar.xz
$ cd gcc-7.3.0
$ ./configure --enable-bootstrap --disable-multilib # there is a RV32I multilib build failure in upstream so I disabled multilib
$ make -j4

I kept the QEMU instance running in screen and checked the console for kernel messages.

The bootstrap took 22 hours. When I applied the patch I sent your via irc, I saw a rcu stall message after 2-4 hours. I think if we take the mie masking out and try

    if (new_mip) {
         cpu_interrupt(CPU(cpu), CPU_INTERRUPT_HARD);
    } else {
         cpu_reset_interrupt(CPU(cpu), CPU_INTERRUPT_HARD);
    }

We might have better luck... Let me know... If I get time I can try it at some point. I also measured idle cpu before and after applying patches, etc, and did some other performance testing...

atishp04 commented 6 years ago

I got a OOM Out of memory: Kill process 32539 (ld) score 317 or sacrifice child Killed process 32539 (ld) total-vm:652140kB, anon-rss:648564kB, file-rss:56kB, shmem-rss:0kB oom_reaper: reaped process 32539 (ld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB collect2: fatal error: ld terminated with signal 9 [Killed] compilation terminated. make[3]: [../.././gcc/c/Make-lang.in:85: cc1] Error 1 make[3]: Waiting for unfinished jobs.... rm gcc.pod make[3]: Leaving directory '/root/gcc-7.3.0/host-riscv64-unknown-linux-gnu/gcc' make[2]: [Makefile:4540: all-stage1-gcc] Error 2 make[2]: Leaving directory '/root/gcc-7.3.0' make[1]: [Makefile:22057: stage1-bubble] Error 2 make[1]: Leaving directory '/root/gcc-7.3.0' make: *** [Makefile:929: all] Error 2

Trying with higher memory size(16G) for qemu

atishp04 commented 6 years ago

@michaeljclark : The fix seems to work. GCC bootstrap went successfully without any rcu stalls.

Let me know if you want me to run any other workload for verificaiton

atishp04 commented 6 years ago

@michaeljclark Just wanted to check if you need more testing or the patch can be merged.

riscvarchive / riscv-qemu

WFI doesn't return when a IPI is issued #132