plougher / squashfs-tools

tools to create and extract Squashfs filesystems
GNU General Public License v2.0
758 stars 193 forks source link

mksquashfs crashes during Fedora installer image build #230

Closed AdamWill closed 1 year ago

AdamWill commented 1 year ago

I'm afraid I don't have full details on this yet (it's very late and I'm filing it in a hurry), but downstream comments indicate a release is imminent so I thought I'd better get a report in just in case it can prevent the release happening with potentially a bad bug.

Fedora downstream testing of a new squashfs-tools build that's based on current git HEAD - https://github.com/plougher/squashfs-tools/commit/36abab0ad661247498834c2e7f5e1ec476f2081d - show a consistent crash when an mksquashfs runs as part of a Fedora installer image build process. See https://bugzilla.redhat.com/show_bug.cgi?id=2178510 . Unfortunately the test didn't produce a coredump because it exceeded the default resource limits; in the morning (if nobody else has got one by then) I'll either reproduce manually or tweak the test to disable the resource limits so we get a coredump.

plougher commented 1 year ago

OK, sorry about that.

Do you know the last version that didn't produce the crash? That would help narrow down what commits caused the problem.

Alternately, if you could give me a reproducer, or test on an earlier version that would be helpful.

plougher commented 1 year ago

hmm, the errors reported -11 and 4 don't appear to be mean much. Interpreted as system call errors (errno) they're EAGAIN and EINTR. Perhaps they'd be more meaningful if I knew how to interpret them.

A web-browse shows there is an earlier recent Squashfs-tools based off git squashfs-tools-4.6-0.2.20230323git7cf6cee.fc39. Has that been known to work? There has been very little done since then except documentation (tedious but necessary) and a couple of annoying minor memory leaks showing up in latest testing.

It is very late here (I was about to go to bed), and I can't do anything more now. If there's been no developments once I get up, I'll see if I can install latest Rawhide.

plougher commented 1 year ago

Interpreting the error codes as a signal makes more sense, although there's no explicit indication in the log that a signal was delivered. 4 is SIGILL and 11 is SIGSEGV.

AdamWill commented 1 year ago

Yes, it's a signal 11 crash (negative return codes indicate signals, by convention).

The most recent build before this was actually even newer, it was squashfs-tools-4.6-0.5.20230312gitaaf011a.fc39 (i.e. https://github.com/plougher/squashfs-tools/commit/aaf011a868c786b06e74cbdaf860d45793939f35 ). That one passed tests. So the cause should be one of the six commits since then.

I will try and get a coredump today, or failing that, I'll try and bisect it.

AdamWill commented 1 year ago

Here's the backtrace

AdamWill commented 1 year ago

That makes https://github.com/plougher/squashfs-tools/commit/83b2f3a233ab7b83471dbb9cad8d99c50791cf71 look like the suspect. I'll try reverting it.

AdamWill commented 1 year ago

Could the problem be that j isn't actually available outside the for loop? i comes from outside the for loop and so is obviously still available once we've finished the loop, but j is part of the loop. I'm no C expert, but Google tells me that when you set a variable in a for loop like this, it's not available outside it: https://stackoverflow.com/questions/7880658/what-is-the-scope-of-a-while-and-for-loop

AdamWill commented 1 year ago

OTOH, I don't understand why before the change, we were doing xattr_list[i - 1].vnext = NULL; not xattr_list[i].vnext = NULL;?

edit: duh. yes I do. when e.g. i is 5, on the last iteration of the loop where we actually do the stuff, j is 4, and we set xattr_list[3].vnext to xattr_list[4]. So once we're finished, we need to set xattr_list[4].vnext - not xattr_list[5].vnext - to NULL. String indexes start at 0, i is a count from 1. Duh.

You actually did the exact opposite change - from xattr_list[j].vnext = NULL; to xattr_list[i - 1].vnext = NULL; in c5db34eef9c1d6e899e3d79e7d13672ee56c7bfc in September, to "fix out of bounds access", which sounds exactly like this bug. Maybe reverting that in this commit was simply inadvertent? You were working from an old version of this file or something?

AdamWill commented 1 year ago

231fixes this in a local test.

plougher commented 1 year ago

You actually did the exact opposite change - from xattr_list[j].vnext = NULL; to xattr_list[i - 1].vnext = NULL; in https://github.com/plougher/squashfs-tools/commit/c5db34eef9c1d6e899e3d79e7d13672ee56c7bfc in September, to "fix out of bounds access", which sounds exactly like this bug. Maybe reverting that in this commit was simply inadvertent? You were working from an old version of this file or something?

Probably something like that. When fixing small memory leaks I like to look at previous versions to understand when and why the leak occurred.

Usual problems, the release is already very late, and I keep on finding more things to do.

Proves the old proverb "more haste less speed" very true.