Permission denied while trying to build in response to #syz test

tytso commented 1 year ago

Describe the bug

An attempt to try testing a bug resulted a failures of the build bot:

kernel clean failed: failed to run ["make" "-j" "32" "ARCH=x86_64" "distclean"]: exit status 2
find: './out/bazel/output_user_root/c186719396625f4bf74deeea0ad5a464/server': Permission denied
rm: cannot remove './dist/virtio_balloon.ko': Permission denied
rm: cannot remove './dist/hci_vhci.ko': Permission denied
rm: cannot remove './dist/mt76x2u.ko': Permission denied
...

This looks like some kind of misconfiguration or state problem on the syzkaller bot? The error report from Syzkaller can be found here:

https://groups.google.com/g/syzkaller-bugs/c/d2yGCX40CBs/m/1KYXZR6zAwAJ

With the full set of errors found here:

https://syzkaller.appspot.com/x/error.txt?x=153fa6d4c80000

To Reproduce Steps to reproduce the behavior.

Send a test request to syzbot as follows

To: syzbot <syzbot+9d16c39efb5fade84574@syzkaller.appspotmail.com>
Subject: Re: [syzbot] [ext4?] possible deadlock in jbd2_log_wait_commit
Message-ID: <20230308022337.GA860405@mit.edu>

#syz test git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git origin

[ Patch elided but you can find a copy of it here: https://syzkaller.appspot.com/x/patch.diff?x=1091adbcc80000 ]

Expected behavior

It should have tested the attached patch, after successfully completing a build. If the error message contains things like "permission denied" perhaps there should have been some kind of automatic recovery by the build bot? Perhaps forcibly restarting the VM, or doing an rm -rf as root of the scratch space? Better yet, how would the files have gotten permission denied in the first place; perhaps that root cause should be addressed?

tytso commented 1 year ago

... and it looks like it's not a fluke. I tried another test, and that triggered another failure:

https://groups.google.com/g/syzkaller-bugs/c/d2yGCX40CBs/m/OHZXKHe0AwAJ

https://groups.google.com/g/syzkaller-bugs/c/d2yGCX40CBs/m/u8rGe4m0AwAJ

tytso commented 1 year ago

By the way, the reason why I'm resorting to use the Syzbot tester is the config has a huge number of random subsystems enabled --- is the minimal repro really require BPF, Bluetooth, and nl802154 to be enabled?

And the .config provided from the syzbot dashboard page results in a kernel which isn't useful in combination with qemu --- at least, it's not compatible with kvm-xfstests. I was going to try to figure out why the config wasn't bootable with my qemu setup (nothing at all was showing up on the serial console), and then I decided life was just too short, and decided to use #syz test --- which then has been failing.

a-nogikh commented 1 year ago

Hi Ted!

You were lucky to hit a rare kind of syzbot breakage :) Thank you for reporting!

I've fixed the infrastructure problem and filed https://github.com/google/syzkaller/issues/3741 to consider ways to prevent it in the future.

the config has a huge number of random subsystems enabled

Hopefully at some point we'll start including the minimized config, we have https://github.com/google/syzkaller/issues/3199 for tracking.

to figure out why the config wasn't bootable with my qemu setup

AFAIK recent versions of qemu have been struggling with big kernel images. Should you still need to run it locally, try to append -machine pc-q35-7.1 to the qemu arguments.

tytso commented 1 year ago

Thanks for the pro-tip regarding appending -machine pc-q35-7.1 I've been using the machine type "q35", which for my version of qemu (7.2.0 on debian, package version 1:7.2+dfsg-1+b2) is aliased to pc-q35-7.2.

I've confirmed that if I explicitly specify the machine type pc-q35-7.1, a kernel compiled with the syzbot config boots. If I use the machine type q35, or pc-q35-7.2, it doesn't boot; nothing shows up on the serial console. Looking at the source code, as well as the qemu documentation, I have no idea why it makes a difference, but apparently it does. :-(

Do you have any idea why it seems to make a difference between a successful versus non-successful boot? Thanks!

a-nogikh commented 1 year ago

Do you have any idea why it seems to make a difference between a successful versus non-successful boot? Thanks!

People have been pointing to this thread about a qemu bug https://lore.kernel.org/qemu-devel/da39abab9785aea2a2e7652ed6403b6268aeb31f.camel@linux.ibm.com/, but, to be honest, I did not dig deep into that.

google / syzkaller

Permission denied while trying to build in response to #syz test #3740