find: infinite loop when performing large amount of inplace file operations on btrfs

moetayuko commented 2 years ago

Background

I was building the Linux kernel in a shell environment (in fact the AOSP build system) where most utils are provided by toybox rather than GNU ones. The build infinitely stuck at a postprocessing step: https://github.com/torvalds/linux/blob/a7904a538933c525096ca2ccde1e60d0ee62c08e/kernel/gen_kheaders.sh#L80 After some investigation, I found that it's relevant to toybox's find utils.

MWE

Reproducible on a btrfs partition, but worked fine on ext4 in my experiments.

mkdir testdir && cd testdir
# only happens when the directory contains a large amount of files
# 500 was sufficient to repro on my Arch PC, but need to increase to 3000 on another Debian server
for i in {1..500}
    touch $(openssl rand -hex 12)  # generate empty test files with random name
cd ..
./toybox-x86_64 find testdir -type f -print0 | xargs -0 -n1 sed -i s/a/b/  # it hangs forever after this

Replacing ./toybox-x86_64 find with GNU findutils fixed the issue.

Environment

Toybox 0.8.6 downloaded from http://landley.net/toybox/bin
Arch Linux with 5.15.10-zen1-1-zen kernel (500 files)
Debian 11 with 5.10.0-9-amd64 kernel (3000 files)

landley commented 2 years ago

On Devuan Barista:

apt-get install btrfs-progs truncate -s 1g btrfs.img mkfs.btrfs btrfs.img mkdir sub mount -t btrfs btrfs.img sub cd sub for i in {1..500}; do touch xxxxxxxx.$i; done cd .. toybox find sub -type f -print0

Completed just fine? (Do you have a test that DOESN'T include a random number generator in hope it reproduces? I'm assuming this is a weird corner case in btrfs hashing behavior, possibly my 4.19.0-16-amd64 kernel from devuan hasn't introduced the btrfs-specific behavior yet? I can install arch in a vm, but "didn't see it" could once again mean openssl rand produced different output...)

landley commented 2 years ago

P.S. Last time I saw a filesystem-specific behavior like this, the problem was actually the block size of the underling readdir/getdents system call differing between the two environments. One granularity was triggering the filesystem bug, the other wasn't. It's not that gnu and busybox(?) had different algorithmic behavior, they were just fetching directory entry batches into different buffer sizes...

moetayuko commented 2 years ago

Thanks for the response!

Make sure find is chained with inplace file operations... Your example is reproducible on my Arch PC (5.15.10 kernel) with the following modifications:

increase 500 to 2000
add a simple inplace operation such as sed: ~/toybox-x86_64 find sub -type f -print0 | xargs -0 -n1 sed -i s/a/b/. The one-line perl script from Linux build system can be adopted as well.

p.s. the btrfs .img was stored on ext4 partition in my test

The issue is irrelevant to random file names, openssl was just used to generate unique filenames.

I was actually considered this as a btrfs bug, but I'm not sure and I don't know how to describe it when reporting to upstream since I'm not aware of the internal implementation of find

landley commented 2 years ago

On 12/20/21 10:16 AM, MoetaYuko wrote:

Thanks for the response!

Make sure find is chained with inplace file operations... Your example is reproducible on my Arch PC (5.15.10 kernel) with the following modifications:

increase 500 to 2000

add a simple inplace operation such as sed: |~/toybox-x86_64 find sub -type f -print0 | xargs -0 -n1 sed -i s/a/b/|. The one-line perl script from Linux build system can be adopted as well.

If you're modifying the directory while traversing it possibly btrfs is re-hashing the contentents in a way that causes the ongoing directory traversal to lose its place?

p.s. the btrfs .img was stored on ext4 partition in my test

Completed again for me. I'm guessing kernel version is involved. Lemme try it on a 5.15 kernel...

The issue is irrelevant to random file names, openssl was just used to generate unique filenames.

I was actually considered this as a btrfs bug, but I'm not sure and I don't know how to describe it when reporting to upstream since I'm not aware of the internal implementation of |find|

The internal implementation of toybox find is 730 lines of C:

https://github.com/landley/toybox/blob/master/toys/posix/find.c

Plus 210 lines of directory tree plumbing:

https://github.com/landley/toybox/blob/master/lib/dirtree.c

And the dirtree stuff largely boils down to:

dirtree(name, callback) { struct dirent entry; struct DIR dir; struct stat;

dir = opendir(name); while ((entry = readdir(dir))) { stat(entry->d_name, &st) callback(...) if (S_ISDIR(st.st_mode)) dirtree(entry->d_name, callback); } }

The find callback is just figuring out what to do with a given entry. Neither of these has the ability to "back up", it can only recurse endlessly if readdir() keeps returning more entries. (Or in theory if recursing into crosslinked subdirectories, but you don't even have any subbdirectories here.)

moetayuko commented 2 years ago

Completed again for me. I'm guessing kernel version is involved. Lemme try it on a 5.15 kernel...

Just keep increasing the count. 2000 completed on another machine with Debian 11 and 5.10 kernel, but reproduced after increasing to 5000

landley commented 2 years ago

I built a mkroot system and documented it here: http://lists.landley.net/pipermail/toybox-landley.net/2021-December/012756.html

Then did:

mount /dev/sda /mnt && cd /mnt echo -e '#!/bin/sh\necho -n "$1 "\nsed s/a/b/ $1' > /funky && chmod +x /funky rm -rf sub && mkdir -p sub && touch sub/{1..50000} && find sub | xargs -n1 /funky && echo

It completed. I went from 1000 to 7000 by thousands, then 14000, then 25000, then 50000...

Is there a kernel .config thing that's needed? I note btrfs has a number of kernel options, all I've switched on is its Anterior Cruciate Ligament support...

moetayuko commented 2 years ago

Weird...Reproduced on 3 different machines w/ distinct OS config for Arch w/ 5.15 kernel: http://fars.ee/2aMv config for Ubuntu 20.04 w/ 5.11 kernel: http://fars.ee/-XeR config for Debian 11 w/ 5.10 kernel: https://termbin.com/k89ov

enh-google commented 2 years ago

not reproducible, but appears to be kernel bug rather than toybox issue.

bmx666 commented 1 year ago

I have the same bug!

Ubuntu 20.04, kernel 5.14 or kernel 5.15
SSD with btrfs mounted to /work
Docker container based on Ubuntu 20.04 that binds /work/aosp as ~/working/aosp

I tried different variants, for example:

# works fine, but has limit on shell command
find $cpio_dir -type f -exec perl -pi -e 'BEGIN {undef $/;}; s/\/\*((?!SPDX).)*?\*\///smg;' {} +

# redirect to sort -z works fine too
find $cpio_dir -type f -print0 | sort -z |
    xargs -0 -P8 -n1 perl -pi -e 'BEGIN {undef $/;}; s/\/\*((?!SPDX).)*?\*\///smg;'

# stuck in infinity loop
find $cpio_dir -type f -exec perl -pi -e 'BEGIN {undef $/;}; s/\/\*((?!SPDX).)*?\*\///smg;' {} \;

# remove parallel 8 threads, stuck in infinity loop
find $cpio_dir -type f -print0 |
    xargs -0 -n1 perl -pi -e 'BEGIN {undef $/;}; s/\/\*((?!SPDX).)*?\*\///smg;'

So there are 2 cases with exec and pipeline have the same issue and in-place cause problem in toybox find version.

# use sed with in-place and dummy pattern, stuck in infinity loop
find $cpio_dir -type f -print0 |
    xargs -0 -P8 -n1 sed -i 's/__DUMMY__//g;'

bmx666 commented 1 year ago

I configure my PC environment and got 100% bug all the time:

OS - Ubuntu 20.04, kernel 5.15.0-76-generic x86_64
SSD - INTEL SSDPEKNW020T8
/etc/fstab - UUID=XXX /work btrfs defaults,autodefrag 0 2
output of mount command - /dev/nvme0n1p1 on /work type btrfs (rw,relatime,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

If folder exceed 988 files - toybox find command does two iterations and then exit with success, but if folder contain more then 1384 - then it stuck in infinity loop.

landley commented 1 year ago

Hmmm, reproduces under Ubuntu 20.04 with btrfs you say...

landley commented 1 year ago

On the host I did:

sudo debootstrap chimaera newroot http://pkgmaster.devuan.org/merged sudo tar cvpJf newroot.tgz newroot truncate -s 8g blah.img

And in two different tabs, I ran: toybox netcat -s 127.0.0.1 -p 8888 -L toybox httpd . kvm -m 1536 -cdrom xubuntu-20.04.4-desktop-amd64.iso -drive format=raw,file=blah.img -boot d

(I note the ubuntu 20.04.4 ISO I had lying around has kernel "5.13.0-30-generic", not .14 or 1.5.)

When xfce finished booting in its window I clicked the "try" button to make the installer go away and opened a terminal:

sudo /bin/bash mkfs.btrfs /dev/?da mount /dev/?da /mnt cd /mnt wget http://10.0.2.2:8888/newroot.tgz tar xpvf newroot.tgz apt install git build-essential git clone https://github.com/landley/toybox cd toybox make defconfig toybox ./toybox find /mnt/newroot -type f -exec perl...

And it completed? I also did ./toybox find /mnt/newroot -type f -exec echo {} \; | wc -l which said 7362 and completed much faster.

I take it the btrfs bug wasn't present in the 5.13 kernel and got introduced more recently?

You didn't say what directory you ran your perl invocation in? It's possible that if something is writing to the btrfs directory, the hashing goes janky and the directory traversal cursor loses its place and backs up, and if the invocation keeps modifying the directory as it's traversing the traversal never ends? That would obviously be a filesystem bug, but might explain what's going on? (And why some things you shell out to have that problem, but other things don't. Find itself is behaving the same either way, you say in your bug report the behavior varies based on what child processes find launches, which... isn't find's fault?)

bmx666 commented 1 year ago

Hi @landley , it doesn't matter perl or sed, it cause error if util is doing in-place insertion.

I made a simple script

#!/bin/bash
set -e

cd /work
rm -fr tmp && mkdir tmp && cd tmp

for i in {1..2000}; do touch $i; done

# simple dump
echo '#!/bin/bash' > inplace.sh
echo 'echo FILE = $*' >> inplace.sh
echo 'sed -i "s/__DUMMY_ABC_//g" $*' >> inplace.sh
chmod a+x inplace.sh

/path/to/toybox find . -type f -exec ./inplace.sh {} \;

Could toybox find use temporary in-place file or it re-check somehow modified time or attributes of file because it was renamed and toybox decides to execute it again as new file?

according strace sed -i 's/a/b/g' /work/tmp/file it creates each time random file name like /work/tmp/sedc6yQuR then sed renames /work/tmp/sedc6yQuR into /work/tmp/file

bmx666 commented 1 year ago

Ok, I found even easy command just move file from X to tmp and back from tmp into X

echo '#!/bin/bash' > inplace.sh
echo 'mv -v "$*" tmp' >> inplace.sh
echo 'mv -v tmp "$*"' >> inplace.sh
chmod a+x inplace.sh

bmx666 commented 1 year ago

I checked GNU findutil source and I found the special property named as munged_filename and it stores original_filename, later to check visits.

bmx666 commented 1 year ago

Hi @landley could you please re-open issue?

I reproduced the same bug on Arch Linux with kernel 6.1.38-2-lts and 6.4.3-arch1-2

I made the test script to reproduce this issue

#!/bin/bash
set -e

CMD_FIND=0
CMD_MOUNT=0

TMPDIR="$(mktemp -d)"
echo "Using temporary dir = $TMPDIR"

function cleanup() {
  set +e
  [[ $CMD_FIND -ne 0 ]] && while [ ! -z "$(jobs -p)" ]; do kill $(jobs -p); sleep 1; done
  [[ $CMD_MOUNT -ne 0 ]] && cd "$TMPDIR" && sudo umount "$TMPDIR"/mount
  cd / && rm -fr "$TMPDIR"
}

trap cleanup EXIT INT HUP

mkdir -p "$TMPDIR" && cd "$TMPDIR"

git clone https://github.com/landley/toybox
cd toybox && ./configure && make defconfig toybox && cd ..

truncate -s 128MB btrfs.img
mkfs.btrfs btrfs.img

mkdir -p mount
sudo mount -o loop btrfs.img mount
CMD_MOUNT=1
sudo chmod -R o+rwx mount
cd mount

for i in {1..2000}; do touch $i; done

echo '#!/bin/bash' > ../inplace.sh
echo 'mv -v "$*" tmp' >> ../inplace.sh
echo 'mv -v tmp "$*"' >> ../inplace.sh
chmod a+x ../inplace.sh

timeout --kill-after=60 60 ../toybox/toybox find . -type f -exec ../inplace.sh {} \; && echo FINISH &
CMD_FIND=1
wait

landley commented 1 year ago

So you've confirmed that when btrfs re-hashes a directory, existing directory traversal loses its place.

Let's see what git log fs/btrfs in the kernel source has to say... that there are 2154 commits to that directory since 5.13. (I used --no-merges even!) Not the world's most stable filesystem, is it? (fs/fat has 42 over the same period.)

Hmmm, possibly btrfs fixed this bug in https://github.com/torvalds/linux/commit/8bb6898da627 although the "monotically increasing counter" mentioned in https://github.com/torvalds/linux/commit/04fc7d5123f2 sounds like "iterate over directory, delete and recreate each item you find, the new item is appended live to the end of your search results even though it didn't exist when you started the readdir(), so this NEVER ENDS" may be a design flaw instead of an implementation flaw. (Commit https://github.com/torvalds/linux/commit/723df2bcc9e1 lets me know this is complicated in btrfs...)

Right, could you try running this in a directory full of files on your btrfs and see if it ever completes?

#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  DIR *dir = opendir(".");
  struct dirent *dd;

  while ((dd = readdir(dir))) {
    printf("%s\n", dd->d_name);
    rename(dd->d_name, "TEMPFILE");
    rename("TEMPFILE", dd->d_name);
  }
  closedir(dir);
}

Because it completed just fine on my ext4 partition...

bmx666 commented 1 year ago

Hi @landley

Yes! It gets stuck in an infinite loop in all cases: direct FS on SSD or IMG FILE mounted as loop (ssd, hdd, tmpfs)

moetayuko commented 1 year ago

Hmmm, possibly btrfs fixed this bug in torvalds/linux@8bb6898da627 although the "monotically increasing counter" mentioned in torvalds/linux@04fc7d5123f2 sounds like "iterate over directory, delete and recreate each item you find, the new item is appended live to the end of your search results even though it didn't exist when you started the readdir(), so this NEVER ENDS" may be a design flaw instead of an implementation flaw.

Not true. The commits were merged in 6.2 and 6.1, but the issue persists in 6.4.3

landley commented 1 year ago

Posix doesn't forbid it ("undefined") but Linux handled this reliably until btrfs. (I very vaguely recall a bugfix about this back under 2.3 development, which is why I knew Linux did get this right because at one point they cared.)

Alas, in their old age the Linux developers have stopped caring about a bunch of things they used to in their old age. I can add a workaround option to find to read the contents of the directory into memory and then iterate over the memory version (internally it's the DIRTREE_BREADTH flag), but it changes the directory traversal order (all files and directories at this level, then descend into the subdirectories afterwards), and there's no flag for it in the debian version's man page I can find...

landley commented 1 year ago

By the way, the easy workaround is probably something like (untested):

find . -print0 > file.idx xargs -0 -n1 bash -c ' mv "$1" tmp; mv tmp "$1" ' < file.idx rm file.idx

I could do buffering inside find to workaround btrfs, but you can just do the buffering yourself.

(Still might be worth sending the test case to the btrfs devs to get them to explicitly say "we don't care what this breaks, it's on you to work around our behavior" on record and all. I've done some head scratching about "maybe -newer would... no, rename doesn't change the timestamp" and so on, but a directory traversal in btrfs is never guaranteed to end, and that's kinda creepy. I wonder what kind of denial of service attacks you could build around that in other programs...)

bmx666 commented 1 year ago

@landley, thanks a lot for your investigation! I'm really frustrated by this issue with btrfs, because Google is using and supplying built toybox as default host toolchain to build Android Kernel for other vendors like NXP and Qualcomm too. I see a lot of benefits in shipping already built binaries as it eliminates a lot of side effects, unlike when toybox was built for the host from supplied sources from external/toybox from AOSP itself, by the way toybox source in AOSP are older than the built binary version from Google (0.8.7 source vs 0.8.9 binary)

I will try to create a ticket about it on kernel.org to update gen_kheaders.sh and add sort -z after find to avoid btrfs side effect, Also I will try to create ticket on Google Tracker about it and maybe Qualcomm.

bmx666 commented 1 year ago

I made a ticket on kernel.org - https://bugzilla.kernel.org/show_bug.cgi?id=217681 and on Google Bug Tracker - https://issuetracker.google.com/issues/291804253

enh-google commented 1 year ago

by the way toybox source in AOSP are older than the built binary version from Google (0.8.7 source vs 0.8.9 binary)

by definition, that ain't true... see https://android.googlesource.com/platform/external/toybox/+/refs/heads/main/toys.h#142 (but, more likely "don't forget to repo sync your AOSP tree!").

bmx666 commented 1 year ago

by definition, that ain't true... see https://android.googlesource.com/platform/external/toybox/+/refs/heads/main/toys.h#142 (but, more likely "don't forget to repo sync your AOSP tree!").

Sorry, but no one use master branch as release delivery from Google

for br in $(git branch -a | grep -- 'origin/android13'); do git show ${br}:toys.h | grep TOYBOX_VERSION; done | sort -V -u

#define TOYBOX_VERSION "0.8.6"TOYBOX_VENDOR

even android-s branches have the latest 0.8.6

enh-google commented 1 year ago

even android-s branches have the latest 0.8.6

sure (ignoring the fact that S is two years old), but S also has the source to 0.8.6. you can't take the source from one branch and the binary from another and complain they don't match... of course they don't!

landley commented 1 year ago

Let me know what the kernel guys say. If the answer is "no", I can add a cache-and-release workaround to the toybox find source.

I want this to work for everybody. It seems pretty clearly like a filesystem bug, but if you're doing a build on that filesystem it needs to work. "Not my bug" doesn't mean I get to ignore it, because it exists and has to work.

I'm wondering if some sort of inode cacheing is sufficient? If I see the same dev+inode a second time, it's stuttering and I can stop... Except will other filesystems do some sort of "return files in alphabetical order, you renamed bcd to xyz so it shows up a second time but there's a zzz after it we haven't seen yet. Which means I don't include the entry with the same dev/inode I've already seen, but I don't get to stop there.

The problem is if I keep reading forever there's no guarantee it terminates on btrfs, and any stop before EOF is basically a heuristic? The cache-and-release approach means we won't perturb it, but we can't guarantee nobody ELSE does (hence denial of service). And if the directory is changing due to other processes editing the filesystem, endless updates aside: the larger the delay between the readdir() and acting upon it, the more likely the entry might have disappeared out from under us. That's a more classic race condition having a wider window to operate on, which is why I didn't want to change the default behavior, and instead add a new option. Maybe -breadth or some such?

landley commented 1 year ago

Closed #448 as a dupe of this, and the new action there is I poked the btrfs mailing list about it, ala https://www.spinics.net/lists/linux-btrfs/msg138634.html

If they recommend a workaround I can add it to find, but nothing I come up with on my own seems reliable in avoiding denial-of-service? (But maybe just "good enough" is the best you can hope with on btrfs? Hope nobody else using the filesystem acts maliciously?)

landley commented 1 year ago

Filipe Manana submitted a kernel fix https://lore.kernel.org/linux-btrfs/c9ceb0e15d92d0634600603b38965d9b6d986b6d.1691923900.git.fdmanana@suse.com/ and I tested it against 6.4 (it applied with offset and worked fine).

Alas spinics.net is giving me an "ERR_ADDRESS_UNREACHABLE" right now but if it works for you, checking the replies to the above link will probably find it, including my mkroot test procedure.

When a kernel comes out with this in it, remind me to close this.

bmx666 commented 1 year ago

Hi @landley, well done! Thanks a lot for your help and effort!

When a kernel comes out with this in it, remind me to close this.

Of course, I will wait a kernel release.

landley commented 1 year ago

Merged into the btrfs maintainer's tree, we just missed the -rc6 pull but there might be an -rc7 pull before the release: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next

(No, I don't know why it's in the log twice.)

bmx666 commented 1 year ago

Nice to have this fix in mainline but if it doesn't go into LTS kernels, then this issue should be open until the latest LTS in the list will be 6.4 😅

landley commented 1 year ago

The kernel fix has been merged for 6.5 https://github.com/torvalds/linux/commit/9b378f6ad48cfa195ed868db9123c09ee7ec5ea2 and backported to 6.4 https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?id=5951039da4a87f2759f928fa921f9d166213e763

bmx666 commented 10 months ago

Hi @landley, good news, there is backported fix for kernel 5.15 -> https://lore.kernel.org/linux-btrfs/cover.1706183427.git.fdmanana@suse.com/ I hope that Ubuntu 20.04 and Ubuntu 22.04 will receive these patches soon and this issue will remain as a part of history 😄

landley commented 10 months ago

I've been cc'd on the emails about it. :)

jacky8hyf commented 4 months ago

Hi @landley ,

Thanks for watching out for the fix on btrfs.

Unfortunately I have recently seen the exact same bug on /dev/shm. Basically, I used the same repro steps starting step 2 in https://github.com/landley/toybox/issues/448 on /dev/shm, and it yields the same behavior (find produces files from 1~2000, then repeats indefinitely).

I am using Linux version 6.6.15 and Debian 13.2.0 if that helps.

I am not re-opening the bug because I think this is a bug in the filesystem as well; just FYI. BTW, it appears that busybox (v1.36.1) doens't handle this either, but GNU find ((GNU findutils) 4.9.0) does.

landley commented 4 months ago

Did you poke the kernel guys already?

landley commented 4 months ago

Ah, https://bugzilla.kernel.org/show_bug.cgi?id=219094

enh-google commented 3 months ago

https://lore.kernel.org/linux-mm/20240731153712.hffmznoujzrck2t6@quack3/T/

landley / toybox

find: infinite loop when performing large amount of inplace file operations on btrfs #306

Background

MWE

Environment