helloSystem / ISO

helloSystem Live and installation ISO
https://github.com/helloSystem/
BSD 3-Clause "New" or "Revised" License
808 stars 58 forks source link

Experimental Live ISO: 'pkg remove' reboots the system #303

Closed probonopd closed 2 years ago

probonopd commented 3 years ago

Experimental Live ISO: pkg remove rebots the system. Tested with 12.2 and 13.0. Related to the use of unionfs? (Need to try without, once we can easily monkey patch ISOs.)

Possibly FreeBSD can be set up to send out crash dumps over the serial console, which can be viewed in QEMU?

probonopd commented 2 years ago

Same when trying to delete any file in /usr/local/ on the Live system. Probably because we are currently mounting the unionfs under the real tree.

probonopd commented 2 years ago

pkg remove leads to the same result when using

mkdir -p /tmp/unionfs/usr/local
mount -t nullfs /media/uzip/usr/local /usr/local
mount -t unionfs /tmp/unionfs/usr/local /usr/local

instead of

https://github.com/helloSystem/ISO/blob/252d853c6d7ba1bd2b9c8570953e0cfcff0d2875/overlays/boot/boot/init_script#L105-L106

However, it seems to be possible to delete files from /usr/local/bin. What might be causing this?

Someone with the knowledge to investigate a kernel crash/instant reboot is needed to look into this.

probonopd commented 2 years ago

Possibly we should make the unionfs mount first and then do the nullfs mount?

probonopd commented 2 years ago

FreeBSD can send kernel dumps over the network. This is especially useful to get kernel dumps from machines runnign Live ISOs.

On the machine that should act as the server on which the dumps will be stored:

ifconfig
sudo pkg install netdumpd
mkdir -p network/dumps
sudo netdumpd -D -d ./network/dumps

On the machine under test:

ifconfig
sudo dumpon -s 192.168.0.xxx -c 192.168.0.yyy <interface>

using the information from ifconfig for the server (-s), local network IP address (-c), and local interface name (e.g., en0).

probonopd commented 2 years ago

Looking at the resulting vmcore with strings, I see toward the end:

panic: lockmgr_xlock_hard: recursing on non recursive lockmgr
0xfffff801197ce628 @ /usr/src/sys/kern/vfs_subr.c:2974
cpuid = 0
time = 1638351901
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff80bdd114 at lockmgr_xlock_hard+0x484
#4 0xffffffff80cfc1b8 at _vn_lock+0x48
#5 0xffffffff80ce4621 at vget_finish+0x21
#6 0xffffffff80b4437d at tmpfs_alloc_vp+0x12d
#7 0xffffffff80b41d31 at tmpfs_lookup1+0x181
#8 0xffffffff80cc9d4d at vfs_cache_lookup+0xad
#9 0xffffffff80cd8120 at relookup+0x90
#10 0xffffffff82cff189 at unionfs_relookup+0xf9
#11 0xffffffff82cff31d at unionfs_relookup_for_delete+0x4d
#12 0xffffffff82d03b05 at unionfs_rmdir+0xa5
#13 0xffffffff8114d717 at VOP_RMDIR_APV+0x27
#14 0xffffffff80cf825d at kern_frmdirat+0x2ed
#15 0xffffffff8108ba8c at amd64_syscall+0x10c
#16 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 2m48s

Full vmcore: https://github.com/helloSystem/ISO/releases/download/assets/vmcore.192.168.0.208.0.tar.bz2

Possibly this might already give some insights as to what is going on? It clearly seems ot be tripping over something unionfs related while trying to remove a directory.

A quick search brings up https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242369 which might be related, as it is also about the combination of tmpfs and unionfs, like the helloSystem Live ISO is using.

probonopd commented 2 years ago

@michaeldexter says at https://twitter.com/michaeldexter/status/1586800937859158018

One rule of unionfs is: change the upper levels all you want but not the lower ones. Upper has been known to break too.

Is this the culprit?

https://github.com/helloSystem/ISO/blob/21f158763f809dbff8060c828db60d5c6b7907a0/overlays/boot/boot/init_script#L111

Would we be better off without -o below, and then mounting tmpfs atop?

probonopd commented 2 years ago

@Stefar77 says at https://twitter.com/Stefar77/status/1587200402512019456

It's a bit less likely to put you in the debugger (panic) instantly when the lower layer is read-only. - I think -

probonopd commented 2 years ago

@darkhelmet433 says at https://twitter.com/karinjiri/status/1587509601011830784

I still wonder about the old altroot code. It was done at an entirely different layer (vfs_lookup) and is basically unionfs-lite. It adds a second / (root) fallback. Eg: put a jail OS layer in a common directory. It was the backend of geocities hosting. Vastly simpler code.

Sounds intriguing. I like simple. Does it still exist? Can it still be built? Where does "the old altroot code" live, any pointers?

probonopd commented 2 years ago

@michaeldexter: Your observation was spot on... Let's see whether this fixes it.

probonopd commented 2 years ago

Well... halfway:

You can now pkg remove packages that you installed while in the Live sesson. But you still can't delete or remove files that are part of the ISO without crashing.